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Abstract 

This paper is concerned with jointly recovering n node-variables {xi} 1<i<n from a collection of pair¬ 
wise difference measurements. Imagine we acquire a few observations taking the form of Xi — Xj ; the 
observation pattern is represented by a measurement graph Q with an edge set £ such that x\ — Xj is 
observed if and only if (i,j) G 8. To account for noisy measurements in a general manner, we model the 
data acquisition process by a set of channels with given input/output transition measures. Employing 
information-theoretic tools applied to channel decoding problems, we develop a unified framework to 
characterize the fundamental recovery criterion, which accommodates general graph structures, alphabet 
sizes, and channel transition measures. In particular, our results isolate a family of minimum channel 
divergence measures to characterize the degree of measurement corruption, which together with the size 
of the minimum cut of Q dictates the feasibility of exact information recovery. For various homogeneous 
graphs, the recovery condition depends almost only on the edge sparsity of the measurement graph ir¬ 
respective of other graphical metrics; alternatively, the minimum sample complexity required for these 
graphs scales like 

minimum sample complexity x U n 

1 1 J i_| i mm 

Hell /2 

for certain information metric Hel™ 1 ^ defined in the main text, as long as the alphabet size is not super¬ 
polynomial in n. We apply our general theory to three concrete applications, including the stochastic 
block model, the outlier model, and the haplotype assembly problem. Our theory leads to order-wise 
tight recovery conditions for all these scenarios. 

Index Terms: pairwise difference, information divergence, random graphs, geometric graphs, homoge¬ 
neous graphs 


1 Introduction 

In various data processing scenarios, one wishes to acquire information about a large collection of objects, 
but it is infeasible or difficult to directly measure each individual object in isolation. Instead, only certain 
pairwise relations over a few object pairs can be measured. Partial examples of pairwise relations include 
cluster agreements, relative rotation and translation, pairwise matches, and paired sequencing reads, as will 
be discussed in details later. Taken collectively, these pairwise observations often carry a substantial amount 
of information across all objects of interest. As a consequence, reliable joint information recovery becomes 
feasible as soon as a sufficiently large number of pairwise measurements are obtained. 

This paper explores a large family of pairwise measurements, which we term pairwise difference measure¬ 
ments. Consider n variables xi, * • • , x n , and imagine we obtain independent measurements of the difference^] 
Xi — xj over a few pairs (i, j). This pairwise difference functional is represented by a measurement graph Q 
with an edge set £ such that Xi — Xj is observed if and only if (i, j) G £ . To accommodate the noisy nature of 
data acquisition in a general manner, we model the observations {yij} as the output of the following channel: 

p(yij\xi- Xj ) . 

Xi-Xj -> yij, V(y,j)e£, (1) 

*Y. Chen is with the Department of Statistics, Stanford University, Stanford, CA 94305, USA (email: yxchen@stanford.edu). 
C. Suh is with the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 305- 
701, Korea (e-mail: chsuh@kaist.ac.kr). A. J. Goldsmith is with the Department of Electrical Engineering, Stanford University, 
Stanford, CA 94305, USA (email: andrea@wsl.stanford.edu). This paper has been presented in part at 
^ere, ” represents some algebraic subtraction operation (broadly defined), as we detail in Sectio: 
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Figure 1: Measurement graph and equivalent channel model. For each edge (i,j) in the measurement graph, 
Xi —Xj is independently passed through a channel with output yij and transition probability p (jjij \ Xi — Xj). 


as illustrated in Fig. [l] Here, the output distribution is specified solely by the associated channel input 
Xi — Xj , with p (• | •) representing the channel transition probability. The goal is to recover x = {x\, • • • , x n } 
based on these channel outputs {yij}. Note that for any connected graph Q, the ground truth x is uniquely 
determined by the pairwise difference functional {xi — xj | (i, j) G £ }, up to some global offset. Therefore, 
the problem can alternatively be posed as decoding the input of the channel ([!]) based on {yij}. 

Problems of this kind have received considerable attention across various fields like social networks, 
computer science, and computational biology. A small sample of them are listed as follows. 

• Community detection and graph partitioning. Various real-world networks exhibit community struc¬ 
tures 129], and the nodes are grouped into a few clusters based on shared features. The aim is to 
uncover the hidden community structure by observing the similarities between members. For instance, 
in the simplest two-community model, the vertex-variables represent the community assignment, and 
the edge variables encode whether two vertices belong to the same community. This two-community 
problem, sometimes referred to as graph partitioning (e.g. 11 ,fl5]), is a special instance of the pairwise 
difference model. 


• Alignment, registration and synchronization. Consider n views of a single scene from different angles 
and position^] One is allowed to estimate the relative translation / rotation across several pairs of 
views. The problem aims at simultaneously aligning all views based on these noisy pairwise esti¬ 
mates. This arises in many applications including structure from motion in computer vision [21 
spectroscopy imaging and structural biology [23{[59j , and multi-reference alignment [7]. 

• Joint matching. Given n images / shapes representing the same physical object, one wishes to identify 
common features across them. The input to a cutting-edge joint matching paradigm is typically a set 
of noisy pairwise matches computed between several pairs of images in isolation 114 j36,49,60,62 1 , which 
falls under the category of pairwise difference measurements. The goal is to recover globally consistent 
maps across the features of all images, by refining these noisy pairwise inputs. This problem arises in 
numerous applications in computer vision and graphics, solving jigsaw puzzles, etc. 


62 


Genome assembly. The genomes of two unrelated people mostly differ at specific nucleotide positions 
called single nucleotide polymorphisms (SNPs). A haplotype is a collection of associated SNPs on 
a chromatid, which is important in understanding genetic causes of various diseases and developing 
personalized medicine. Among various sequencing methods, haplotype assembly is particularly effec¬ 
tive from paired sequencing reads (8,25,35 , which amounts to reconstructing the haplotype based 


2 In a variety of applications including structure from motion and cryo-EM, these views (e.g. photos of some architectures, 
or projected images of 3D molecules) are given to us without revealing their absolute camera poses / angles with respect to the 
3D structure of interest. 
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on disagreement between pairs of single reads [TF, 39,55 
measurement with binary alphabet. 


a special instance of pairwise difference 


Many of these practical applications have witnessed a flurry of recent activity in algorithm development, 
which are primarily motivated by computational considerations. For instance, inspired by recent success 
in spectral methods |40,41| and convex relaxation [9j|ll] (particularly those developed for low-rank matrix 
recovery problems), many provably efficient algorithms have been proposed for graph clustering [I], joint 
matching [1411 , synchronization [59], and so on. While these algorithms have been shown to enjoy intriguing 
recovery guarantees under simple randomized models, the choices of performance metrics have mainly been 
studied in a model-specific manner. On the fundamental-limit side, there have been several results in place 
for a few applications, e.g. stochastic block models [28|[48], synchronization [l3], and haplotype assembly 
Despite their intrinsic connections, these results were developed primarily on a case-by-case basis 
instead of accounting for the most general observation models. 

In the present paper, we emphasize the similarities and connections among all these motivating appli¬ 
cations, by viewing them as a graph-based functional fed into a collection of general channels. We wish to 
explore the following questions from an information-theoretic perspective: 


1. Are there any distance metrics of the channel transition measures and graphical properties that dictate 
the success of exact information recovery from pairwise difference measurements? 

2. If so, can we characterize the interplay between these channel separation metrics and graphical con¬ 
straints and provide insights into the feasibility of simultaneous recovery? 


All in all, the aim of this work is to gain a unified understanding about the performance limits that underlie 
various applications falling in the realm of pairwise-measurement based recovery. In turn, these fundamental 
criteria will provide a general benchmark for algorithm evaluation and comparison. 


1.1 Main Contributions 


The main contribution of this paper is towards a unified characterization of the fundamental information 
recovery criterion, using both information-theoretic and graph-theoretic tools. In particular, we single out 
and emphasize a family of minimum channel separation measures (i.e. the minimum Kullback-Leibler (KL), 
Hellinger, and Renyi divergence), as well as two graphical metrics (i.e. the minimum cut size and the cut- 
homogeneity exponent defined in Section 4.1), that play central roles in determining the feasibility of exact 


recovery. Equipped with these metrics, we develop a sufficient and a necessary condition for information 
recovery, which apply to general graphs, any type of input alphabets, and general channel transition measures. 
Encouragingly, as long as the alphabet size is not super-polynomial in n, these two conditions coincide 
(modulo some explicit universal constant) for the broad class of homogeneous graphs, subsuming as special 
cases Erdos-Renyi models, homogeneous geometric graphs (e.g. generalized rings and grids), and many other 
expander graphs. 

In a nutshell, the fundamental recovery criterion is specified by the product of the minimum channel 
divergence measures and the size of the minimum cut. Intuitively, this product characterizes the amount 
of information one has available to differentiate two minimally separated input hypotheses. Somewhat 
surprisingly, for a variety of homogeneous graphs, the recovery criterion relies only on the edge sparsity of 
the measurement graph. Equivalently, the minimum sample complexity required for exact recovery in these 
homogeneous graphs scales as 


, . . n log n 

minimum sample complexity x -— 

1 1 J i_i irnin 

Mel l/2 

for some information metric Hel^ to be specified later, provided that the alphabet size is polynomial in n. 
This result holds irrespective of other second-order graphical metrics like the spectral gap. 

The unified framework we develop is non-asymptotic, in the sense that it accommodates the most general 
settings without fixing either the alphabet size or channel transition probabilities. This allows full charac¬ 
terization of the high-dimensional regime where all parameters are allowed to scale (possibly with different 
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rates) — a setting that has received increasing attention compared to the classical asymptotics where only 
n is tending to infinity. 

Finally, to illustrate the effectiveness of our general theory, we develop concrete consequences for three 
canonical applications that have been investigated in prior literature, including the stochastic block model, 
the outlier model, and the haplotype assembly problem. In each case, our theory recovers order-wise correct 
recovery guarantees, and even strengthens existing results in certain regimes. 


1.2 Related Work 


On the information-theoretic side, most prior works focused on binary input and output alphabets. Among 
them, Abbe et al. [2] characterized the orderwise information-theoretic limits under the Erdos-Renyi model, 
uncovering the intriguing observation that a decoding method based on convex relaxation achieves nearly- 
optimal recover guarantees under sparsely connected graphs. In addition, Si et al. [55] and Kamath et al. [39] 
determined the information-theoretic limits for a similar setup motivated from genome sequencing, which 
correspond to random graphs and (generalized) ring graphs, respectively. A sufficient recovery condition 
for general graphs has also been derived in [2], although it was not guaranteed to be order optimal. Our 
preliminary work 113] explored the fundamental recovery limits under general alphabets and graph structures, 
but was restricted to the simplistic outlier model rather than general channel distributions. In contrast, the 
framework developed in the current work allows orderwise tight characterization of the recovery criterion for 
general alphabets and channel characteristics. 

The pairwise measurement models considered in this paper and the aforementioned works [2 ,[13,139 [[55] 
can all be treated as a special type of “graphical channel” as coined by Abbe and Montanari [3[|4], which 
refers to a general family of channels whose transition probabilities factorize over a set of hyper-edges. 
This previous work on graphical channels centered on the metric of conditional entropy that quantifies the 
residual input uncertainty given the channel output, and uncovered the stability and concentration of this 
metric under random sparse graphs. In comparison, the present paper primarily aims to investigate how the 
channel transition measures affect the recovery limits in the absence of channel coding, which was previously 
out of reach. Specifically, the information limit under optimal channel coding is determined by the mutual 
information metric; in contrast, the information limit without channel coding is often dictated by certain 
minimum divergence metrics, which could sometimes be much smaller than the mutual information. This 
arises because optimal encoding enables us to code against the channel variation by maximizing the output 
separation between distinct input hypotheses, while in the non-coding applications one has to deal with the 
minimally separated input hypotheses determined by the practical applications. In addition, we focus on full 
recovery in this work, but in some applications this might be too stringent. Recent interesting work [ 3l]|32| 
explored the notion of partial recovery under binary alphabets, which highlighted the two-dimensional grids 
and supplied a two-step polynomial-time recovery algorithm. A more general theory regarding partial or 
approximate recovery is left for future work. 

Finally, the input variables {xi} can be viewed as discrete signals on the graph Q. Recent years have 
seen much activity regarding discrete signal processing on graphs [52,54 . For instance, it has been studied 
in 1121 how to optimally subsample band-limited graphs signals, subject to a sampling rate constraint, while 
enabling perfect signal recovery. Our model differs from this line of work in that the samples we take are 
highly constrained—that is, we only allow pairwise difference samples taken over the edges—and hence the 
resulting sample complexity significantly exceeds the sampling rate limit. 


1.3 Terminology and Notation 

Graph terminology. Let deg (v) represent the degree of a vertex v. For any two vertex sets Si and S 2 , 
denote by £ (<Si,<S 2 ) (resp. e(<Si,<S 2 )) the set (resp. the number) of edges with exactly one endpoint in Si and 
another in £ 2 - A complete graph of n vertices, denoted by K n , is a graph in which every pair of vertices is 
connected by an edge. Below we introduce several widely used (random) graph models; see |27,50| and the 
references therein for in-depth discussion. 


1. Erdos-Renyi graph. An Erdos-Renyi graph of n vertices, denoted by <5 n?p , is constructed in such a way 
that each pair of vertices is independently connected by an edge with probability p. 
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2. Random geometric graph. A random geometric graph, denoted by (7 n>r , is generated via a 2-step 
procedure: (i) place n vertices uniformly and independently on the surface of a unit spher^j (ii) 
connect two vertices by an edge if the Euclidean distance between them is at most r. 


3. Expander graph. A graph Q is said to be an expander graph with edge expansion hg if e (<S, <S C ) > hg |<S| 
for all vertex set S satisfying \S\ < n/2. 

Divergence measures. Our results are established upon a family of divergence measures. Formally, for any 
two probability measures P and Q, if P is absolutely continuous with respect to Q, then the KL divergence 
of Q from P is defined as 

KL(P||Q):=|dPlog(^P), (2) 

whereas the Hellinger divergence of order a G (0,1) of Q from P is defined to be [44|[56 

Hel Q (P || Q) := -P [l - [ (dP)“ (dQ) 1 " 0 ] . (3) 

i - a l J J 

When a = 1/2, this reduces to the so-called squared Hellinger distanc ^ 

Hel i (P || Q) = 2 - 2 [VdPy/dQ = [(VdP - y/dQ) 2 . 


The x 2 divergence is defined as 




fdP 


x (p|IQ)= y U- 1 ’ dQ 


(4) 

(5) 


In particular, when P = Bernoulli (p) and Q = Bernoulli (g), we abuse the notation and let 

KL (p || q) — KL (P || Q) , Hel a (p || q) = Hel a (P || Q ), and X 2 (p || q) = X (P II Q) • (6) 


More generally, the /-divergence of Q from P is defined as 

D f (P || Q) := Jf dQ (7) 

for any convex function / (•) such that /(1) = 0 [44j[56j. Note that the Hellinger divergence of order <a, the 
KL divergence, and the y 2 divergence are special cases of /-divergence generated by f(x) = 

f{x) = x\ogx (or f(x) = x\ogx — x +1), and / (x) = {x — l) 2 , respectively. These divergence measures can 
often be efficiently estimated even under large alphabets; see, e.g., [38] and their subsequent work. 

Finally, we introduce the Renyi divergence of positive order <a, where a / 1, of a distribution P from 
another distribution Q as [5l||58] 

Da(P\\Q): = -r/^log (/ (dPr(dQ) 1 -^ (8) 

= - P— log (1 - (1 - a) Hel a ). (9) 

1 — a 

It follows from the elementary inequality 1 — x < e~ x that D a (P [j Q) > Hel a (P || Q). This together with 
the monotonicity of D a [58| Theorem 3] gives 

Hel a (P || Q) < D a (P || Q) < KL (P || Q ), 0 < a < 1. (10) 


Other notation. Let 1 and 0 be the all-one and all-zero vectors, respectively. We denote by supp (x) 
(resp. \\x\\ 0 ) the support (resp. the support size) of x. The standard notion f(n) = o(g(n)) means 
lim f(n)/g[n) = 0; f(n) = u (g(ri)) means lim g(n)/f(n) = 0; f(ri) = Cl(g(n)) or f(ri) > g[n) mean 

n—t oo n—too 

there exists a constant c such that f(n) > cg{n)\ f(n) = O (g(n)) or f(n) < g(n) mean there exists a 
constant c such that f(n) < cg{n ); f(n) = 0 (g(n)) or f(n) x g(n) mean there exist constants c\ and C 2 
such that c\g{n) < f(n) < C 2 g{n). Throughout this paper, log (•) represents the natural logarithm. 

3 We consider Q n ,r, on a unit sphere instead of [0, l] 2 to eliminate edge effects. 

4 Several other sources introduce a prefactor of 1/2 in order to normalize the squared Hellinger distance, resulting in the 
definition f ^(y/dP — ^/dQ) 2 . Here, we adopt the unnormalized version as given in 57 Section 2.4]. 
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1.4 Organization 

The remainder of the paper is organized as follows. In Section [2j we describe the formal problem setup and 
introduce the key channel distance measures. We develop non-asymptotic sufficient and necessary recovery 
conditions for the special Erdos-Renyi model in Section [3j along with some intuitive interpretation of the 
results. Section [4] presents the recovery conditions in full generality, which accommodate general alphabets, 
graph structures, and channel characteristics, with particular emphasis on the family of homogeneous graphs. 
To illustrate the effectiveness of our framework, we apply our general theory to a few concrete examples in 
Section [5] Section [6] concludes the paper with a summary of our findings and a discussion of future directions. 
The proofs of the main results and auxiliary lemmas are deferred to the appendices. 

2 Problem Formulation and Key Metrics 

2.1 Models 

Imagine a collection of n vertices V = {1, ■ • ■ , n}, each represented by a vertex-variable Xi over the input 
alphabet X := {0,1, • • • , M — 1}, where M represents the alphabet size. 

• Object representation and pairwise difference. Consider an additive group formed over X 
together with an associative addition operation (broadly defined). For any Xi , Xj E A, the pairwise 
difference operation is defined as 

Xi-Xj := Xi + (-Xj), ( 11 ) 

where — x stands for the unique additive inverse of x. We assume throughout that satisfies the 
following bijective property: 


\/xi E X : 


Xi + Xj 7 - Xi -\- xi, \/xi 7 ^ Xj ; 
Xi + Xj 7 ^ Xi + Xj , Mxi 7 ^ Xi. 


A partial list of examples includes: 


( 12 ) 


1. Modular arithmetic : if we define “+” to be the modular addition over integers {0,1, • • • , M — 1}, 
then Xi — Xj (mod M) is a valid example of O 

2. Relative rotation : set Xi = Ri for some rotation matrix Ri and let “+” denote matrix multipli¬ 
cation. Then x\ — Xj stands for R, Rj S which represents the relative rotation between i and j, 
and hence is a special case of 

3. Pairwise map : if we set Xi to be some permutation matrix TC and let “+” be matrix multiplication, 
then the pairwise map between two isomorphic sets—captured by EbllJ—also belongs to the 
pairwise difference model. 


• Measurement graph and channel model. The measurement pattern is represented by a mea¬ 
surement graph Q that comprises an undirected edge set £, so that xi — Xj is measured if and only 
if (i, j) E S. As illustrated in Fig. [lj for each (i, j) E £ (i > j), the pairwise difference Xi — Xj is 
independently passed through a channel, whose output yij follows the conditional distribution 


PiVij 


Xi — Xj = l 


) = I Pi (: Vij ), 0 <1 < M. 


(13) 


Here, P/ (•) denotes the transition measure that maps a given input l to the output alphabet T; see Fig. [ 2 ] 
for an illustration. With a slight abuse of notation, we let P^ = P^ mo d m fc> r any — 1}. 

We assume throughout that the observations are symmetric^] in the sense that there exists a one-to- 
one mapping between yij and yji for any (i,j) E S; that said, all information are contained in the 

5 We assume the observation model is symmetric because this is the case in all motivating applications listed in Section^ 
We note, however, that all results and analyses immediately extend to the non-asymmetric case, provided that Pj (•) is defined 

with respect to (yij,yji), that is, P i(yij,yji) := p(yij,yji\Xi ~ %j = 0- 
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Figure 2: The probability measure P/(•) is defined to be the distribution of yij given xi — Xj = l. 

upper triangular part The output alphabet y can be either continuous or discrete, finite 

or infinite, which allows general modeling of distortion, corruption, etc. As opposed to conventional 
information theory settings, no coding is employed across channel uses. 

This paper centers on exact information recovery, that is, to reconstruct all input variables x = {aq, • • • , x n } 
precisely, except for some global offset This is all one can hope for since there is absolutely no basis to dis¬ 
tinguish x from its shifted version x + l • 1 = {x\ +/,••• , x n + given only the output y := { y\j \ (i,j) £ £}. 
In light of this, we introduce the zero-one distance modulo a global offset factor as follows 

dist (w.x) := 1— max I\w = x + l-l\, (14) 

o <1<M 

where I is the indicator function. Apparently, dist (w,x) = 0 holds for all w that differ from x only by a 
global offset. With this metric in place, we define, for any recovery procedure T n , the probability 

of error as 

P e fif) := max P jdist (y ), x) / 0 I x\ . (15) 

The aim is to characterize the regime where the minimax probability of error inf^, P e (ijj) is vanishing. 

2.2 Key Separation Metrics on Channel Transition Measures 

Before proceeding to the main results, we introduce a few channel separation measures that capture the 
resolutions of the measurements, which will be critical in subsequent development of our theory. Specifically, 
we isolate the minimum KL, Hellinger, and Renyi divergence with respect to the channel transition measures 
as follow^] 


KL min 

:= min KL (P* || P fe ); 

(16) 

Hel“ in 

:= min Hel Q (P* || P fe ); 

l^k 

(17) 

7-^min 
^ OL 

:= min(P, || P fc ) = - —!—log (l - (1 - a) Hel“ in ) . 
z//c 1 — a \ / 

(18) 


These minimum divergence measures essentially reflect the distinguishability of channel outputs given min¬ 
imally separated input Q As will be seen later, the minimum Hellinger and Renyi divergence are crucial 
in developing sufficient recovery conditions, while the minimum KL divergence plays an important role in 
deriving minimax lower bounds. It is well known (see 144.53,56,571 for various inequalities connecting them) 
that these measures are almost equivalent (modulo some small constant) when any two probability measures 

6 Here and throughout, we assume that is absolutely continuous with respect to P& for any l and k. 

7 One natural question arises as so how to estimate such divergence metrics from measured data, which has become an 
active research topic. When both the input and out put alphabet sizes are small, one can first estimate the entire probability 
measures via low-rank matrix recovery schemes (e.g. |37| ), and then plug them in to calculate the divergence metrics. When the 
alphabet size is large or when the output is continuous-valued, one might resort to more careful functional estimation algorithms 
(e.g. |38||42p - 
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under study are close to each other—a regime where two measures are the hardest to differentiate. In par¬ 
ticular, we underscore one fact that links the KL divergence and the squared Hellinger distance, which we 
shall use several times in the rest of the paper; see (26| Proposition 2] for an alternative version. 

Fact 1. Suppose that P and Q are two probability measures such that 


dP „ 
d Q ~ 


and < R 
d P 


hold uniformly over the probability space. Then, one has 


max {2 - 0.5 log R, 1} • Heli (P || Q) < KL (P || Q) < (2 + logP) • Heli (P || Q) . (19) 

Furthermore, if R < 4.5, then one has 

(2 - 0.4log R) • Heli (P || Q) < KL (P || Q) < (2 + 0.4log R) • Helj (P || Q) . (20) 

Proof See Appendix |H| □ 

We conclude this part with another quantity that will often prove useful in tightening our results. Specif¬ 
ically, for any ( > 0, we define 

m kl (C) := max 1| i ± l, KL (P* \\¥ t ) < (1 + <) KL min ||. (21) 


It is self-evident that 1 < m kl (£) < M holds regardless of This quantity determines the number of distinct 
input pairs under study that result in nearly-minimal output separation. 


3 Main Results: Erdos—Renyi Graphs 

At an intuitive level, faithful decoding is feasible only when (i) the measurement graph Q is sufficiently 
connected so that we have enough measurements involving each vertex variable, and (ii) the channel output 
distributions given any two distinct inputs are sufficiently separated and hence distinguishable. To develop a 
more quantitative understanding about these two factors, we start with the Erdos-Renyi model, a tractable 
yet the most widely adopted random graph model for numerous applications. Specifically, we suppose that 
the measurement graph Q is drawn from G n ,p ohs fc> r some edge probability p Q bs ^ log n/n. As will be shown 
in Section |4j many properties and intuitions that we develop for this specific graph model hold in greater 
generality. 


3.1 Maximum Likelihood Decoding 

To begin with, we analyze the performance guarantees of the maximum likelihood (ML) decoder 


V4ni (y) := arg max P {y \ x} . (22) 

It is well-known that the ML rule minimizes the Bayesian probability of error under uniform input priors. 
We develop a sufficient recovery condition in terms of the edge probability and the minimum information 
divergence, which characterizes the tradeoff between the degree of graph connectivity and the resolution of 
channel outputs. 

Theorem 1. Fix S > 0, and suppose that Q ~ Gn,p ohs • Then there exist some universal constants C, c\ > 0 
such that if 

sup {(1 - a) Hel“ m | • (Pobs^) > (1 + (5) log (2ra) + 21og (M - 1), (23) 

0<a<l ^ ) 


then the ML decoder m i obeys 

Pe (^ml) < 


(2 n ) max {f (5_ i <52 ’ V} _ 


+ 


>io 


- 1 


+ Cn 


—c\8n 
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Figure 3: The pairwise inputs [xi—Xj]i<ij< n under a realization of G n ,p ohs with n = 8 andp Q bs = 0.3 as shown 
in (a). The input patterns are shown for (b) the ground truth x = 0 , (c) the hypothesis x — [1, 0, • • • ,0], and 
(d) the hypothesis x = [0, • • • ,0,1]. The blue parts represent the entries being measured, and the orange 
region constitutes the parts of pairwise inputs that disagree with the ground truth. 


Proof. See Appendix [A] □ 

Theorem [l] essentially suggests that the ML rule is guaranteed to work with high probability as long as 

sup {(1 - a) Heir) > (1 + »(1)) + . 

a L J p obs n 

Our result is non-asymptotic in the sense that it holds for all parameters (n, M, Hel™ m ) instead of limiting 
to the asymptotic regime with n tending to infinity. Recognizing that p 0 bs^ is exactly the average vertex 
degree d avg , our recovery condition reads 


sup |(1 - a) Hel“ m | • d avg > logn, 


(24) 


provided that M < O (poly (n)) and a E (0,1) is some fixed constant independent of n. 

We pause to develop some intuitive understanding about the condition (24). In contrast to classical 


information theory settings, the channel decoding model considered herein concerns “uncoded” channel input. 
Consequently, the recovery bottleneck for the ML rule is presented by the minimum output distance given 
two distinct hypotheses, rather than the mutual information that plays a crucial rule in coded transmission. 
To be more precise, two hypotheses x and x are the least separated when they differ only by one component, 
say, r. As a concrete example, one can take x = [0, • • • ,0] and x = [1,0, • • • ,0] as illustrated in Fig. [3j The 
resulting pairwise outputs {yij \ (i,j) £ £} thus contain about deg (v) pieces of information for distinguishing 
x and x; see, e.g., the orange shaded region highlighted in Fig. [3] Since the information contained in each 
measurement can be quantified by certain divergence metrics, namely, Hel™ m (or KL mm as adopted in Section 
3.2), the total amount of information one has available to distinguish two minimally separated hypotheses 
is captured by 


HeL 


cL. 


KL n 


do 


(25) 


Furthermore, there are at least n distinct hypotheses that are all minimally apart from the ground truth x 
(e.g. x — [1, 0, • • • , 0], x = [0,1, • • • ,0], • • •, x — [0, • • • , 0,1]). Representation of these hypotheses calls for at 
least logn bits, and hence the information that one can exploit to distinguish x from them—i.e. (25)—needs 
to exceed logn. This offers an intuitive interpretation of the recovery condition (24). 

Careful readers will note that Theorem [l] is presented in terms of the Hellinger divergence rather than 
the KL divergence. We remark on these this technical matter as follows. 

Remark 1 . In general, we are unable to develop the recovery conditions in terms of the KL divergence. This 
arises partly because the KL divergence cannot be well controlled for all measures, especially when ^ L 

J oo 

(Z 7 £ j) grows. In contrast, the Hellinger divergence is generally stable and more convenient to analyze in 
this case. 
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We conclude this part with an extension. Examining our analysis reveals that all arguments continue to 
hold even if the output distributions are location-dependent Formally, suppose that the distribution of yij 
is parametrized by 


P [Vij 


i) = KHvij )> o <l<M,(i,j) 


6 f. 


(26) 


This leads to a modified version of the minimum divergence metric as follows 


• i i 1 

HeL 


:= mm 


{ Hel a (Pf II P«) \l^k,0<l,k<M, ( i,j ) € e) ; 


(27) 


With these modified metrics in place, the preceding sufficient recovery condition immediately extends to this 
generalized model. 


Theorem 2. The recovery condition of Theorem^ 7] continues to hold under the transition probabilities ^2\ 
if Hel™ m is replaced by Hel™ m as defined in \2_ 


3.2 Minimax Lower Bound 

In order to assess the tightness of our recovery guarantee for the ML rule, we develop two necessary conditions 
that apply to any recovery procedure. Here and below, H (x) := — xlogx — (1 — x) log (1 — x) stands for the 
binary entropy function. 

Theorem 3. Suppose that Q ~ Gn,p ohs • Fix any ( > 0 and e > 0, and assume that p Q bs > clo ^ n for some 
sufficiently large constant c > 0. 

(*) If 


KL mm • p obs n < 


(1 - e) ( log n + log m kl (C)) - H (e) 


(1 + e) (1 + C) 


(28) 


then inf^ P e > e — n 10 . 

(b) Suppose that a < and p 0 \> s n > 2eolog n. If 

(1 - a) Hel^ m -pobs^ < 


ea log n 


- r e 


for some residual^r e , then inf^ P e (x/j) > n e — n 
Proof. See Appendices |B] and [C] 


-10 


(29) 


□ 


We remark that the two necessary recovery conditions in Theorem [3] concern two regimes of separate 
interest. Specifically, Condition (28) based on the KL divergence is most useful when investigating first- 


order convergence, namely, the situation where we only require the minimax probability of error to be 


asymptotically vanishing without specifying convergence rates. In comparison, Condition (29) based on the 
Hellinger distance is more convenient when we further demand exact recovery to occur with polynomially 
high probability (e.g. 1 — F). In various “big-data” applications, the term “with high probability” might only 
refer to the case where the error probability decays at least at a polynomial rate. 


On the other hand, while Condition (28) is not directly presented in terms of M, we can often capture 
the effect of the alphabet size through the surrogate m kl , provided that logm kl x logM. In fact, this arises 
in many scenarios of interest. As an example, see the outlier model to be discussed in Section 15.21 where 
m kl =M — 1 . 


8 More precisely, r e := log 2 + 2 ^ £a log ^ 
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3.3 Tightness of Theorems [l] and [3] and Minimal Sample Complexity 

Encouragingly, the recovery conditions derived in Theorems [l] and [3] are often tight up to some small multi¬ 
plicative constant. In the sequel, we will assume that p 0 b s logn/n, and will pay special attention to two 
of the most popular divergence metrics: the KL divergence and the squared Hellinger distance. 

1. Consider the first-order convergence, that is, the regime where inf^ P e (VO 0 (n -0 oo). Combining 
Theorems [l] and [3ja) suggests that 

infPeWO n ~^ 0 if Hel™ • p 0 bs^ > (1 + o (1)) (2 logn + 4 log M ), 

0 2 

ti— yoo 

inf P e (VO 0 if KL mm • pobs^ < (1 — o ( 1 )) (logn + logm k ) . 

0 


When applied to the most challenging case where ^|r- = 1 + o(l) for all l j- j, these conditions read 
(with the assistance of Fact [l]) 

0 if HelT m • p ohs n > (1 + o (1)) (2 log n + 4 log M ), 

log n + log m kl 


inf P e (VO 

0 


inf P e (VO —f^ 0 


if Hel? m • pobs n < (1 - o (1)) 


0 ' ' * ' a v v " 2 

which are matching conditions modulo some multiplicative factor not exceeding 

4 log n + 8 log M 


( 1 + 0 ( 1 )) 


log n + log m k 


(30) 

(31) 


(32) 


2 . We now move on to more stringent convergence by considering the regime where lim^oo inf^ P e (VO 
1/n. Putting Theorem [l] (with 5 = 3) and Theorem [3^ b) (with a = 1/2 and e ® 1) together implies 
that 

infP e (V0 < — if HelT m • p 0 bs^ > (1 + o (1)) (8logn + 41ogM), (33) 

0 n 2 

infP e (V0 > — if HelT m • pobs^ < (1 - o (1)) logn, (34) 

0 n 2 

which holds regardless of ^ L . These conditions do not impose constraints on the alphabet size, and 
are tight up to a multiplicative gap of 

(1 + °( 1 )) ( 8 + • ( 35 ) 


Remark 2. The multiplicative factor (32) is small when either the alphabet size M = O (poly log(n)) (in 


which case this factor is 4) or when M x m k (in which case this factor is at most 8 ). Similarly, the 


multiplicative factor ( |35| ) is the smallest when the alphabet size is O (poly log(n)). However, both results 
might become loose when logM and grow. For many practical applications, the alphabet size is 

typically much smaller than n, in which case our results are tight within a reasonable constant factor. 

In summary, we have characterized the fundamental recovery condition under the Erdos-Renyi model, 
which reads 

Heir n -d avg > logn (36) 

as long as the alphabet size M is not super-polynomia]^] in n. Put another way, in order to allow exact 
recovery, the sample complexity —i.e. the total number of edges of Q (which is around nd avg / 2 )—necessarily 
obeys 

minimum sample complexity x ^ ^ in . (37) 


9 When the alphabet size is super-polynomial in n, our upper and lower bounds are within a factor of O ( ) from optimal. 

We note, however, that in all our motivating applications, the alphabet size M is typically much smaller than exp (0 (n)) and 
hence the regime with super-polynomial alphabet size is of little practical relevance. 


11 










Interestingly, these simple characterizations as well as the underlying intuitions carry over to many more 
homogeneous graphs, as will be seen in the next section. 

Before concluding this section, we remark on the possibility of improving the preconstant. There are a 
few cases where the tight preconstants have been settled, including the stochastic block model |28j[34] and 
the censor block model |l7,j34], provided that the alphabet size (or the number of communities) is M = 2. 
Asymptotically, the necessary and sufficient recovery condition reads 


sup {(l-a)Heir 1 ) 

0<a<l ^ J 


logn 

Pobs^l 


. . . n log n 

or minimum sample complexity > - - ( -t-, 

2 S up 0<a<1 {(l-a)HeC n } 

thus justifying the tightness of the sufficient recovery condition we derive. In fact, when the alphabet size 
is a constant, the fundamental divergence measure that dictates the information limits is often some variant 
of the minimum Chernoff infer mat iorp*| or the Hellinger divergence (when optimized over a). Nevertheless, 
it is not clear whether such findings extend to the large alphabet settings. Part of the reason is that the 
minimum Chernoff information or Hellinger divergence do not necessarily capture the precise error exponent 
when testing many hypotheses. We leave to future work the investigation of tight preconstants in the large 
alphabet scenarios. 


4 Main Results: General Graphs 

We now broaden our scope by exploring general measurement graphs beyond the simple Erdos-Renyi model, 
with emphasis on the family of homogeneous graphs. 

4.1 Preliminaries: Key Graphical Metrics 

Our theory relies on several widely encountered graphical metrics including the minimum vertex degree, the 
average vertex degree, the maximum vertex degree, and the size of the minimum cut, which we denote by 
d m in 5 d a vg: d max , and mincut, respectively. This subsection introduces a few other not-so-common graphical 
quantities that prove crucial in presenting our results. 

For any integer m, define 


A f (m) : = {S C V | e (<S,<S C ) < ra} , 


(38) 


which comprises all cuts of size at most m. We are particularly interested in the peak growth rate of the 
cardinality of J\f as defined below 


,7-CUt- 

T k •— 


ic log 


AT (k • mincut) 


and r cut := max rf ut . 

k>0 


(39) 


In the sequel, we will term r cut the cut-homogeneity exponent. In fact, if we rewrite 


cut 


:= mincut • 


k • mincut 


log 


J\f (k • mincut) 


(40) 


then we see that r cut relies on two factors: (i) the cut-set distribution exponents log |A/’(fc)| } fe>0 an d (2) 
the size of the minimum cut, both of which are important in capturing the degree of homogeneity of the 
cut-set distribution. This metric is best illustrated through the following two extreme examples: 

• Complete graph K n on n vertices. This homogeneous graph obeys e (<S, S c ) = \S\ (: n — |<S|) and mincut = 
n — 1. A simple combinatorial argument suggests that \J\f (m) \ x ( m ^ n ) x n m / n , revealing that 

cut I.,.,,, . N | Ik - mincut n 

r = max — log L/v (k • mincut) x max--logn x logn. 

k k 1 1 k k n 

10 Note that the Chernoff information is defined to be — log {l — sup 0<a<1 (l — a)Hel a }. 
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• Two complete subgraphs K n / 2 connected by a single bridge. In this graph, the min-cut size is mincut = 1 
due to the existence of a bridge, but we still have \Af (m) | < n m / n when m > n. A little algebra gives 


1 k • mincut 1 logn 

max --log n x - 

k k n n 


— max — log |A T (k • mincut) 


< 


Interestingly, for various homogeneous graphs of interest, r cut can be bounded above in a tight and simple 
manner, namely, r cut < logn. This is formally stated in the following lemma, which accounts for homo¬ 
geneous geometric graphs and expander graphs. In words, a graph is said to be a homogeneous geometric 
graph if it satisfies two properties: (i) each connected pair of vertices shares sufficiently many neighbors; (ii) 
when two vertices are geometrically close, they share a large fraction of neighbors. Here and throughout, we 
shall use V (u) to denote the set of neighbors of a vertex u. 

Lemma 1. (1) Homogeneous geometric graphs. Suppose that Q is connected and is embedded in some 
Euclidean space. Assume that there exist two numerical constants p > 0 and 0 < n < \ such that 

(a) for each (u,v) G £, 

|V (w)flV(r)| > p • mincut; (41) 


(b) for each (n, v ) G £, denoting by the i th closest vertex to v among the vertices in V (u) D V (v) , 
one has 


V (v) \ V(w^) 

Under the above two conditions, one has 


< —p • mincut, 1 < i < np - mincut. 


8 


r cut < — log (2 n). 
np 

(2) Expander graphs. If Q is an expander graph with edge expansion hg, then 

mincut 


T CUt < 


hg 


log n + log 2. 


Proof. See Appendix [Ej 


(42) 

(43) 

(44) 

□ 


We highlight a few concrete examples covered by this lemma. 

• The following instances of homogeneous geometric graphs are worth mentioning. The first is a random 
geometric graph Q n ^ r , provided that r 2 > clogn for some sufficiently large c > 0. The second is a 
generalized ring in which two vertices are connected as long as they are at most a few vertices apart. 
For both cases, k and p are constants bounded away from zero, indicating that 

r cut < log n. 


• Another situation concerns those expander graphs with good expansion properties, including but not 
limited to Erdos-Renyi graphs, random regular graphs, and small world graphs. Since the expansion 
properties of these graphs obey hg/ mincut = 0 (1), we conclude from Lemrna[l]that 

r cut <logn. 


As a final remark, we are not aware of a graph for which r cut exceeds the order of log n. In all aforementioned 
examples, one always has r cut < logn. In-depth study about the upper limit on r cut might lead to further 
simplification of our results, which we leave for future work. 
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4.2 ML Decoding and Minimax Lower Bounds 

This section presents recovery conditions based on the minimum information divergence and certain graphical 
metrics, which accommodate general graph structures, channel characteristics, and input alphabets. We defer 
detailed discussion of our results to Section [4731 

To begin with, the following theorem—whose proof can be found in Appendix [D]— characterizes a regime 
where the ML decoder is guaranteed to work. 


Theorem 4. Consider any connected graph Q. For any S > 0 and any 0 < a < 1, the ML rule t/> m i achieves 


Pe (V’ml) < 


1 

( 2 n) S -1 


provided that 

sup {(1 — a) Hel™ m j • mincut > 8r cut + (6 + 8) log (2n) + 41ogM. (45) 

Remark 3. Theorem [ 4 ] continues to hold if Hel™ m is replaced by D™ m . 

The sufficient recovery condition given in Theorem [4] is universal and holds for all graphs, and depends 
only on the min-cut size and the cut-homogeneity exponent irrespective of other graphical metrics. Similarly, 
the above sufficient condition extends to the scenario with location-dependent output distributions, as stated 
below. 


Theorem 5. The recovery condition of Theorem^continues to hold under the transition probabilities if 
Hel™ m and are replaced by Hel™ and D™ n , respectively, where D™ := — log (l —(1 — a) Hel™). 

Next, we present a fundamental lower limit on KL mm that admits perfect information recovery, based on 
the same graphical metrics in addition to the maximum vertex degree. 

Theorem 6 (KL Version). Fix ( > 0 and 0 < e < 1/2. For any graph Q, if the KL divergence satisfies 

(1 — e) logm kl (C) — H(e)\ 


KL n 


mincut < max < (1 — e) r cut — H(e), 


or 


KL n 


X max _+ 


< 


1 + C 

(1 — e) (logn + logm kl (0 )~H(e) 


1 + C 


then the minimax probability of error exceeds inf^ P e (ip) > e. 
Proof. See Appendix [B] 


(46) 


(47) 


□ 


Notably, the conditions (46) and (47) do not imply each other. The first condition (46)—which charac¬ 
terizes the effects of cut-set distributions and alphabet size—is dominant for inhomogeneous graphs where 
mincut <C d max (e.g. the graph formed by connecting two K n / 2 with a single bridge as described in Section 
4.1). In comparison, the other condition (47) becomes tighter as l L mcut grows, which is particularly useful 


^max 

f x mincut. 


when accounting for the family of homogeneous graphs where d n 

Finally, we complement the above KL version by another lower bound developed directly based on the 
Hellinger divergence, although it becomes loose for those inhomogeneous graphs obeying mincut d max . 
This is particularly useful when investigating the scenario that demands high-probability recovery (e.g. with 
success probability at least 1 — n _1 ). The proof can be found in Appendix |c| 

Theorem 7 (Hellinger Version). Consider any graph Q, any e > 0, and a < Suppose that d } 
2ecdog n. If 

(1 - a) Hel£ 

for some residua^^r e , then inf^ P e (fj) > n~ e . 


> 


• 4ax < eo log n — r e 


(48) 


Remark 4. For any fixed e > 0, one can optimize (48) over all 0 < a < to derive a tighter condition 


11 More precisely, r e := log 2 + 2[ealogn log 2 ] ^ 
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4.3 Interpretation and Discussion 

We now discuss the messages conveyed by the aforementioned results, for which we emphasize a broad family 
of homogeneous graphs before turning to the most general graphs. In what follows, our discussion assumes 
^ = 0(1) for all 0 <l,j < M, in which case one has (by invoking Fact |TJ) 

KL min x Heir n . (49) 

2 

Operating upon such assumptions enables us to significantly simplify the presentation, while still capturing 
the regime that is statistically the most challenging (compared to its complement regime where ^ ^1). 


4.3.1 Homogeneous Graphs 

Our recovery conditions are most useful when applied to homogeneous graphs. Formally speaking, we term 
Q a homogeneous graph if it satisfies 

mincut x <f avg x <f max , (50) 

which subsumes as special cases the widely adopted Erdos-Renyi graphs, random geometric graphs, small 
world graphs, rings, grids, and many other expander graphs. A few implications are in order. 


1 . 


For all homogeneous graphs, one has 


inf P e (VO 

b 

n—Kx> q 

if HelT in • d aV g 


n—»• oo 


inf P e (VO 

b 

-h o 

if HelT in • d avg 


> r cut + log n + log M, 
< r cut + log n + log m kl . 


In general, these results are within a multiplicative gap 

< r cut + log n + log M 

^ P ^ r cut _|_ log n log m k\ 

from optimal, which are orderwise tight when either M < poly (n) or logm kl x logM. In particular, as 
long as the alphabet size is not super-polynomial in n, we arrive at the fundamental recovery condition 
for this class of graphs: 

Heir n .d avg > logn + r cut . (51) 

2. In comparison to the recovery guarantee developed for the Erdos-Renyi model, the condition © 
includes one extra correction term r cut concerning the cut-set distribution. To provide some intuition 
about r cut , suppose that the ground truth is x = 0 and consider an alternative hypothesis x whose 
non-zero entries are all identical. If we denote by S the vertex set corresponding to the support of cc, 
then it is straightforward to see that all measurements that can help distinguish x and x reside in the 
cut set £(<S,<S C ). By definition, r£ ut determines the total number of distinct cuts whose size is within 
some fixed range. Since r£ ut is defined in a logarithmic and normalized manner, this in turn specifies 
how many bits are needed to represent all these cuts and, hence, all hypotheses associated with them. 
As a consequence, r cut presents another information-theoretic requirement. 


3. While our results fall short of a general upper bound on r cut , we note that r cut < logn holds for a 
broad class of interesting models studied in the literature (and in fact all models that we are aware of), 
including but not limited to various homogeneous geometric graphs and expander graphs (cf. Lemma 
[I]). As a consequence, the recovery condition (51) for these graphs further simplifies to 


HeL 


d a 


> 


logn, 


(52) 


which coincides with the one under the special Erdos-Renyi model. Following the intuition given 
in Section |3.1[ one must rely on around d avg measurements to distinguish two minimally separated 
hypotheses—i.e. those that differ by a single component—and hence the information bottleneck con¬ 
stitutes around HelT 111 • d avg bits, which needs to be at least logn bits in order to encode n minimally 
apart hypotheses. 
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Measurement graphs 

Fundamental recovery conditions 

Erdos-Renyi graphs 

Hel“/ 2 n • d av g > logn 

homogeneous geometric graphs, expander graphs 

Hel™2 • davg > logn 

homogeneous graphs 

Hel “/2 • c4 vg > logn + r cut 

general graphs 

Hel™ • mincut > g (n) + r cut (1 < g ( n ) < logn) 


Table 1: Summary of key results for all graph models (M < poly (n)). 


4. The condition (52) in turn leads to an interesting observation: for a variety of homogeneous graphs, 
the information-theoretic limits for graph-based decoding are determined solely by the edge sparsity , 
as opposed to the performance guarantees for many tractable algorithms (e.g. spectral methods or 
semidefinite programming) whose success typically rely on strong second-order expansion properties. 


Finally, by combining Theorem [4] and Theorem [t] (with e = 1), we arrive at the following criterion concerning 
“high-probability” recovery: for various homogeneous graphs that obey r cut < logn, the probability of error 
P e (V>) < n _1 is possible if and only if 

HelT n -d avg > logn. (53) 

In contrast to the preceding discussion, this statement holds regardless of how ^ scales. 


4.3.2 General Graphs 

We now move on to discussing the results in their full generality. One distinguishing feature from the family 
of homogeneous graphs is that the recovery boundary is dictated by the size of the minimum cut rather than 
the graph edge sparsity. For the convenience of the reader, we summarize all key results in Table [TJ 


1. Tightness under general graphs. The recovery conditions presented in Theorems [4] and [6] can be 
summarized as follows 


inf P e (V>) 

b 


0 


inf P e (VO 0 


if HelT m • mincut > r cut + logM + logn, 

2 

if HelT m • mincut < r cut + logm kl + m j nCUt logn. 


These are within a multiplicative gap from optimal, satisfying that 


gap £ 


r cut + log M + log n 
T cut + log m kl + m ' ncut log n 

Cirri ax 


Recognizing that r cut > 1, we see that the derived bounds are orderwise optimal when logm kl x 
logM x logn (e.g. in the outlier model presented in Section [ 5 T 2 ] under large alphabet). Even for the 
loosest case, the gap is at most logarithmic (i.e. O (logn + logM)). 


2 . 


Information bottleneck. In contrast to ( |51| ) and (52), the amount of information one has available 
to differentiate two minimally separated hypotheses is approximately given by HelT m • mincut instead 


of HelT m • d avg . This makes sense since the two hypotheses that are most difficult to differentiate are 
no longer those that differ by one component. Instead, the most challenging task lies in linking the 
variables across the minimum cut, which can convey at most HelT m • mincut bits of information, forming 
the most fragile component for simultaneous recovery. 


3. A unified non-asymptotic framework. Our framework can accommodate a variety of practical 
scenarios that respect the high-dimensional regime: the alphabet size might be growing with n while 
the channel divergence metrics might be decaying. Furthermore, our problem falls under the category 
of multi-hypothesis testing in the presence of exponentially many hypotheses, where each hypothesis 
is not necessarily formed by i.i.d. sequences. Under such a setting, the conventional Sanov bound 1221 
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based on the Chernoff information measure 1241 becomes unwieldy. In contrast, our results build 
upon alternative probability divergence measures (particularly the Hellinger / Renyi divergence). This 
results in a simple unified framework that enables non-asymptotic characterization of the minimax 
limits (modulo some constant factor) simultaneously for most settings. 

In general, the current approach is unable to close the worst-case gap O (log n + logM), which could be large 
when either n or M are exceedingly large. In order to improve the recovery conditions, one alternative is to 
derive a tighter lower bound on the graphical metric r cut . For instance, our bounds become orderwise tight 
whenever r cut > logn, which arise in various graphs beyond the family of homogeneous graphs. We leave 
this for future investigation. In addition, our general lower bounds are developed based on Fano’s inequality, 
since Fano’s inequality allows us to accommodate a set of input hypotheses that have significant overlaps. 
Unfortunately, Fano’s inequality typically relies on the KL divergence between input hypotheses, which is in 
general not capable of capturing the right error exponent for hypothesis testing. It would be interesting to 
develop a variant of the Fano-type inequality based directly on the Chernoff information measures. 


5 Consequences for Specific Applications 

In this section, we apply our general theory to a few concrete examples that have been studied in prior 
literature. As will be seen, our general theorems lead to order-wise tight characterization for all these 
canonical examples. 

5.1 Stochastic Block Model 

We start by analyzing the stochastic block model (SBM), which is a generative way to model community 
structure. In the standard SBM, nodes are partitioned into two disjoint clusters (so one can assign labels 
Xi G {0,1} for each node). Each pair of nodes is connected with probability al ° gn or /31 ° gn depending on 
whether they fall within the same cluster or not. The goal is to infer the underlying clusters that produce 
the network. Of particular interest is exact recovery of the entire clusters, which has received considerable 
attention; see 1 1,6,16, 19,28|30,[M||43| [46}|48||M] for a highly incomplete list of references. 

We focus on the regime where <a, (3 = o (n/ logn) and a > /?, which subsumes all but the densest commu¬ 
nity structures. Treating the SBM as a graphical channel over a complete measurement graph (i.e. p Q bs = 1) 
with outputs being either 0 or 1—which encodes whether two nodes belong to the same cluster or not, we 


see that (cf. Definition 13) 


= Bern 


a logn 


n 


and Pi = Bern 


/3 logn 


This allows us to compute 

Heir n = 


I a log n f3 log n 


n 



a logn / /3 logn 


= (l + o(l))(^-v^) 2 ^. 


In addition, the relation between KL divergence and y 2 divergence (e.g. 158, Equation (7)]) suggests that 


KL n 


< KL 


((3 logn logn 


(a) 


\ n 

(3 log n _ a log n 


< X^ 


a log n l -1 a log n \ 

n \ n J 


((3 logn a logn 
\ n n 


) =(i+ 0 (i)) ( “-' s)2iogn 


a n 


(54) 
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where (a) follows from the identity y 2 (p || q) = ^ ^ . With these two estimates in place, 

Theorem [l] and Corollary [3] immediately give 

infP e (^) n ^° 0 if (>/a- V^) 2 > 2(1 + o(l)), (55) 

b 

n—t oo 0 

infPe^) ^4 0 if (a — /3) 2 < (1 — o (1)) a. (56) 


In fact, precise phase transition for exact cluster recovery has only been determined last year 128,48 
There results assert that 


inf P e {ip) 

72—^00 

n—>• oo 

0 if (y/a - \/Jd) 2 > 2, 

(57) 

inf P e {ip) 

tp 


o 

?, 

1 

tc 

A 

to 

(58) 


justifying that the sufficient condition we develop is precise. When it comes to the necessary condition, 
one can verify that the condition (58) is more stringent thaij^ 2 ] (a — /3) 2 < 4 (a + /?). In comparison, the 
boundary of our condition (56) is sandwiched between the curves (a — /3 ) 2 < - (a + /3) and (a — /3) 2 < a + /3. 
These taken collectively indicate that our theory is tight up to a small constant factor. 

Several remarks are in order. To begin with, our results accommodate all values of a, (3 up to o (n/ logn), 
which is broader than |[28] that concentrates on the sparsest possible regime (i.e. a,/3 x 1). Leaving out 
this technical matter, a more interesting observation is that the achievability bound we develop for the ML 
rule matches the fundamental recovery limit in a precise manner, which seems to imply that the squared 
Hellinger distance is the right metric that dictates the recovery limits for the SBMs. 

When finishing up this paper, we became aware of a very recent work [5] that characterizes the fun¬ 
damental limits for the generalized SBM, that is, the model where n nodes are partitioned into multiple 
clusters. Extending our framework so as to accommodate the SBM in its full generality is a topic of future 
work. 


5.2 Outlier Model 


We now turn to another model called the outlier model , which subsumes as special cases several applications 
including alignment, synchronization, and joint matching (e.g. [14,36,59]). 

Suppose that the measurements s are independently corrupted following a distribution 


Vij — 


Xi Xj , 

Unif M , 


with probability p t rue, 
else, 


(59) 


where UnifM is the uniform distribution over {0, • • • , M — 1}, ptme stands for the non-corruption rate, and 
” is some general subtraction operation defined in Section [2] In words, a fraction 1 — ptme of measurements 
act as random outliers and contain no useful information. Note that under this outlier model, one has 

m kl (e) = M — 1, Ve > 0. 

The following corollary—an immediate consequence of Theorem [l] and Corollary [3]— presents concrete 
recovery limits for the outlier model. For ease of presentation, we restrict our discussion to the Erdos-Renyi 
model, but remark that all results extend to homogeneous geometric graphs and other expander graphs (up 
to some constant factors) if one replaces p Q bs^ with the average vertex degree. 


12 To see this, observe that (y^ — yffi) 2 < 2 is identical to (a — (3 ) 2 < 2 (yda + \/d) 2 , which is more stringent than (a — (3) 2 < 
4 (a + /3) due to the elementary inequality (a + 6) 2 < 2 (a 2 + b 2 ). 
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some sufficiently large c\ > 0. Then, one has 

^ 0 ^ Jf (^ 1 ~~ ■ PtrUe + MPtrue “ V 1 - Pt rue) >(l + e) 


inf P e (VO 

0 


inf P e (V’) -0 o */ptrue < max • 


(1 — e) (logn + logM) M /logn 
p obs n log (l + ’ M - 1 Uoben M 


h Pobs > Cl '° s 

- for 

\ + 2 log M 

(60) 

Pobs^ 

0} 

(61) 


To establish this corollary, we start by considering the graph t/true that comprises ah edges where 
Vij =Xi- Xj. It is self-evident that t/ true ~ G n ( Ptlue , i-ptrue ) Pubs , and thus ((l - j^) p true + jj) Pobs > 

is necessary to ensure connectivity (otherwise there will be no basis to link the node variables across dis¬ 
connected components). Apart from this, everything boils down to calculating KL mm and Hel mm , which we 
gather in the following lemma. 


Lemma 2. Consider the outlier model (59). For any 0 < ptme < 1, one has 

Ptrue^f 


K|_ mm _ p true log 1 


^^ an d HelT m — — Ptrue + AfPtrue — \/1 — Ptrue) • 


1 Ptrue 

More simply, these metrics can be bounded as 

Ptrue M 


KL 


< 


Ptrue M 


(62) 

1 and Hel? m >—-(63) 

1 Ptrue 2 2(1 Ptrue T M Ptrue) 

Proof. See Appendix [F] □ 

To illustrate these guarantees numerically, we depict in Fig. [4] an example of the preceding recovery 
conditions. In the sequel, we will discuss the tightness and implications of the above result for specific 
regimes, ranging from small alphabet to large alphabet. For convenience of theoretical comparison, we 
supply an alternative form obtained by applying the general theory but using the bounds (63): 


inf P e (V>) 

0 


0 if Ptrue > 2 (1 + e) 


'(1 - Ptrue + Afp true )(logn + 21ogM) 


Pob s nM 


inf P e (VO -Tf o if Ptrue < 


(1 — e ) 


'(1 - Ptrue) (logn + log M) logn 1 


Pob s nM 


Pobs^ J 


(64) 

(65) 


5.2.1 Tightness under Binary Alphabet 

We start with the case where Mm 2, which was also studied by |2 . When p Q b s our results (64) and 

(65) assert that 


inf P e (^) 

0 


0 if Ptrue > (1+0(1)), 


infP e (^) ~h 0 if Ptrue < (1 - o(l)) • 


1 2 log n 

Pobs^ 

I log n 

^Pobs^ 


( 66 ) 

(67) 


As a result, our bounds are within a factor 2 + o(l) from optimal, which holds for ah possible values of 
(Pobs •> Ptrue)• This constant gap is illustrated in Fig. |4ja) as well. 
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Figure 4: The sufficient and the necessary conditions given in Corollary [I] when n = 10 5 . The results are 
shown for: (a) M — 2; (b) p 0 bs = 0.05. 


In contrast, the bounds presented in |^j fall short of a uniform constant factor gap accommodating 
different parameter configurations. Adopting our notation, |2, Theorems 4.1 - 4.2] reduce tcp*) 


inf P e O) n -^° 0 if Ptrue > 

b 


inf P e O) ~h 0 if Ptrue < 


1 2 log n 
PobsTl 


1 2 (1 — St/2) log n 

Pobs^l 


where 0 < r < | is some numerical value so that p Q b s < 2 n T 1 . Hence, their bounds are tight up to a factor 


9(r) = 


?( 1 ) 


\/l — 3r/2 ’ 


which approaches 1 in the sparse graph regime as r —s- 0 (e.g. p ( ,bs x On the other hand, it does not 

deliver meaningful conditions for the case where r > | (i.e. 2n _ 3 < p obs < 1). In comparison, our bounds 
are looser for sparse graphs (r < | or p obs < where g (r) < 2, but tighter for dense graphs (r > b or 

Pobs > -4) where ^(r) > 2. 

Notably, when p obs x the fundamental limit approaches in an accurate manner |2|. This 

n V Pobs^ | I 

again corroborates the tightness of our achievability bound, implying that the squared Hellinger distance is 
the right quantity to control in the sparsest possible regime. 


5.2.2 From Small Alphabet to Large Alphabet 

The recovery conditions given in Corollary [l] can be further divided into and simplified for two respective 
regimes, depending on whether Mp true < 1 or Mp true > 1. By substituting each of these two hypotheses into 


(64), deriving the corresponding minimum ptme fo r the respective case, and then checking the compatibility 


of PtmeAf with the hypotheses, one immediately deduces: 

13 Note that ptrue = 1 — 2e and d = np 0 b s for the notation £ and d defined in [5j, respectively. 
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1. When M = o , one has 


infP e (^) n - Z ^° 0 ifptrue 

b 

n—too 

inf P e (lp) A* 0 ifptrue 


> 2 (1 + o (1)) 


<(l-o(l)) 


I log n + 2 log M 

Pobs^lM 

I log n + log M 

PobsnM 


2. When M = u (*^), one has 


inf P e (ip) 

b 

inf P e (ip) 

■0 


0 ifptrue > 4(1 + 0 (1)) 


log n + 2 log M 
PobsTl 


—f^ 0 if Ptrue < 


logn 
PobsTl 


( 68 ) 

(69) 


(70) 

(71) 


That being said, the recovery boundary presented in terms of p true exhibits contrasting features in two 
separate regimes, as illustrated in Fig. §b). Some interpretations are in order. 

1. Information-limited regime (M = o The amount of information that can be conveyed 

through each pairwise measurement is captured by the divergence measure. In this small-alphabet 
regime, a little algebra gives KL mm ~ p% TUG M (see Lemma [ 2 ]), which is increasing in M. As a result, 
the alphabet size limits the amount of information that we can harvest, and the fundamental recovery 
boundary improves with M. For Erdos-Renyi graphs, the recovery conditions are tight up to a factor 

of 2 in the presence of a constant alphabet size, and up to a factor of 2 J | for all M <C d m [ n / log n. 


2. Connectivity-limited regime (M = uj When M further increases and enters this regime, 

the information carried by each measurement saturates and no longer scales as p^ rue M. In this regime, 
the measurement graph Q presents a fundamental connectivity bottleneck. In fact, if ptrue = o 
then there will be at least one vertex that is not connected with a single useful measurement, and 
hence there will be absolutely no basis to infer the value of this isolated vertex. Our bounds in this 
regime are order-wise optimal as long as the alphabet size is not super-polynomial in n. 


5.3 Haplotype Assembly 

The pairwise measurement model can also be applied to analyze the haplotype assembly problem discussed 
in Section [l] As formulated in |39||55], consider n SNPs on a chromosome, represented by a sequence 
{xi, • • • ,x n } G {0, l} n such that a major (resp. minor) allele is denoted by 0 (resp. 1). Employing certain 
sequencing technologies, one obtains a collection of independent paired reads such that for any (i,j) E £, 

(fc) f XiQxj, w.p. 1-0, f72 \ 

\ Xi ® Xj 01, w.p. 6. ' ' 

Here, stands for the k th noisy read of the parity between the i th and the j th SNPs, and 0 < 0 < 1/2 
denotes the read error rate. We assume that the reads taken on each edge are independent. 

A realistic measurement graph that respects current sequencing technologies is the one in which mea¬ 
surements are obtained only when the i th and the j th SNPs are geometrically close, i.e., \i — j\ < w for some 
constanlp^l w > 0. This is captured by a generalized ring graph , denoted by Q x i ng = (V, £ r i ng ), such that 

(hj) € £rin g iff \i-j\ < w. (73) 

14 As discussed in [ 39 ], the separation between two DNA reads (called the insert size) is typically bounded within a small 
range, with the median insert size not exceeding a few times the separation between adjacent SNPs. 


21 











The number Lfj of reads taken between i and j is assumed to be dependent on their separation, i.e 0 

j Lp\i—j\ (74) 

for some parameters L and {pi | 1 < l < w}. 

Additionally, a random and geometry-free measurement model has been investigated in [55] as well. 
The fundamental limit under this model is orderwise equivalent to that under an Erdos-Renyi graph with 
Lij = L for all (i,j) G £. For the sake of completeness, we derive consequences for both models as follows. 


Corollary 2. Consider the model (72), and assume that 0 and pi are bounded away from 0. 
(1) Suppose that Q ~ Q v i ng . There exist some universal constants c\ > > 0 such that 

»> 2 < e 2 !5 p. 


yL > (pogn 
Lnpobs 
2 < c 5 log n 
Z/77-Pobs 


inf P e (pf) 

b 

n— ^ 00 

0 

if (1 

inf P e (V>) 

b 

n—>00 

-h 

0 

if (1 

(2) Suppose that Q ~ Q n , Pobe and p ohs > Csl °s™ 
exist some universal constants 04,05 > 0 such that 

for some 

inf P e (VO 

b 

77—^00 

0 

if (1- 

inf P e (VO 

b 

n— >• 00 

0 

if (1- 


(75) 

(76) 


(77) 

(78) 


Proof. For the sufficient condition, we only need to calculate the Renyi divergence. For each (i. j) £ £, 
letting D{j 2 be the Renyi divergence of order 1/2 between the distributions of {yff}\ <k<L itj given two 
distinct inputs (i.e. 0 and 1), one obtains 


- D 


1/2 


m u 


-21og(l--Hel {0 || 1-9) 


> L id Hel (6 || 1 - 6 ), 


(79) 


where (i) follows from additivity of Renyi divergence [58j Theorem 2.8], and (ii) follows since 1 — x < e 
Furthermore, 


iHel (6* || 1 -6)=(V6- = 


(1-2 ey 


(Vo + VT^e) 2 


(1-2 Of 


(80) 


Recall that Lfj = Lp\i_j\ x L when Q ^ Q x i ng , whereas Lij = L when Q ~ G n ,p 0bs - These taken together 
with Theorem [5] and Lemma [l] (resp. Theorem [2| establish the sufficient condition for Q r i ng (resp. Gn,p 0bs )- 
For the necessary condition, by replacing all pi with ma xipi in (74), we obtain a new model such that 
any sufficient recovery condition for the original model holds for this new model as well. We then move on 
to compute the KL divergence for the new model: 


KL n 


( a ) 

< LKL (0 || 1 - 9) < Lx 2 (6> || 1 - 6) = L 


(1 - 2 ey 


8(l-») , 

where (a) follows from |58] Equation (7)]. Substitution into Theorem [6] finishes the proof. 


(81) 

□ 

We now compare our results with prior results. The fundamental limits given in |39,[55] were based on 
coverage (or sample complexity) as a metric, that is, the total number of reads required for perfect haplotype 


15 Careful readers will note that this assumption is different from the model adopted in |39[|55| , where the total number of 
reads is fixed with the reads independently generated. Nevertheless, the model considered here (which significantly simplifies 
presentation) is sufficient to capture the right scaling of the performance limits, since these two models are orderwise equivalent 
due to measure concentration. 
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assembly. Recognizing that nLw (resp. (£ )p 0 b s L ) captures the order of the total number of paired reads for 
Sring (resp. G n ,p ohs ), we see that the minimal sample complexity obeys 


nLw x 

nlogn 

when Q ~ <? ri „ g ; 

(82) 

(1 - 2 O) 2 ' 
n log n 

Lpobs ^ 

when g ~ g n , Pobe - 

(83) 

(1 -26») 2 ’ 


Consequently, for the generalized ring graph, our results match the sample complexity limits characterized 
in 1391 in an orderwise sense, which is proportional to 

n log n n log n n log n 

l_ e -KL(o.5|| 0 ) ~ KL(0.5|| 0) = (i+o(l)) (1-2 6) 2 ' 


On the other hand, for the Erdos-Renyi graphs, the minimum sample complexity scales as //^e) 2 ? which 
coincides with the orderwise limits O(nlogn) derived in [55] . 

Notably, our results are not restricted to the classical large-sample asymptotics where 0 is fixed while 
n grows to infinity. This strengthens [39][55] by accommodating the regime where 6 — 1/2 = o( 1), which 
characterizes the non-asymptotic tradeoff between n and the read quality. As a final remark, while our 
results are tight in capturing the right scaling w.r.t. the read error rate as well as the number of SNPs, our 
derivation is not tight in characterizing the behavior w.r.t. pi (or the notation W given in |39]). 


6 Concluding Remarks 

This paper investigates simultaneous recovery of multiple node variables based on noisy graph-based mea¬ 
surements, under the pairwise difference model. The problem formulation spans numerous applications 
including image registration, graph matching, community detection, and computational biology. We develop 
a unified framework in understanding all problems of this kind based on representing the available pairwise 
measurements as a graph, and then representing the noise on the measurements using a general channel with 
a given input/output transition measure. This framework accommodates large alphabets, general channel 
transition probabilities, and general graph structures in a non-asymptotic manner. Our results underscore 
the interplay between the minimum channel divergence measures and the minimum cut size of the mea¬ 
surement graph. Moreover, for various homogeneous graphs, the recovery criterion relies almost only on 
the first-order graphical metrics independent of other second-order metrics like the spectral gap. We expect 
that such fundamental recovery criterion will provide a general benchmark for evaluating the performance 
of practical algorithms over many applications. 

For concreteness, we restrict our attention to the pairwise difference model in this paper, but we remark 
that the analysis framework is somewhat generic and applies to a broader family of pairwise measurements. 
For instance, consider a more general invertible pairwise relation, denoted by X{ © Xj, that satisfies 


I Xi ex 2 + X! ©x 3 , Vx 2 /x 3 ; 

\xi © X 2 ^ X A © X2, \/Xi^X A . 

As an example, the addition operator defined as Xi Qxj := axi + bxj (mod M) falls within this class as long 
as both (a, M ) and (6, M ) are coprime. Interestingly, most of the analyses carry over to such models and 
reappear suitably generalized. Details concerning full generalization of our results are left for future work. 

While our paper centers on the minimax recovery involving all possible input configurations, there exists 
another family of applications where the inputs fall within a more restricted class (e.g. the class of inputs 
whose components are spread out over the entire alphabet). In addition, it would be interesting to establish 
how the fundamental limits can be improved under the partial recovery setting, namely, the situation where 
one only demands reconstruction of a (large) fraction of input variables. Even in the exact recovery situation, 
it remains to be seen whether the universal pre-constants can be further tightened. Moving away from the 
statistical guarantees, another important issue is the computational feasibility of information recovery. It has 
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recently been demonstrated by [lT] that the information and computation limits meet for many homogeneous 
graphs (e.g. rings, lines, small-world graphs, grids). It would be of great interest to see whether there exists 
any computational gap away from the statistical limits for a more general family of graphs. 
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A Proof of Theorem [l] 

Suppose that both the ground truth and the null hypothesis are x = x*. Consider the class of alternative 
hypotheses parametrized by k (1 < k < n) as follows 


Uk := {x | \\x-x*\\ Q =n-k }, 


(85) 


which comprises at most (^) (M — l) n ~ k distinct hypotheses. For notational convenience, denote by F w (•) 
(resp. P 0 (•)) probability measure of y conditional on the alternative hypothesis x = w (resp. the null 
hypothesis x = x*). We let P e ,u k represent the probability of error when restricted to the class Hk of 
alternative hypotheses. For simplicity of presentation, we will assume x* = 0 in what follows, but all steps 
apply to other choices of x*. 

For any w G H/c, denote by Si (0 < i < M ) the set of vertices v obeying w v = 2 , and let rii = |<S*|. 
Apparently, there are \ e (<%> Sf) distinct locations (l,j) satisfying l > j and wi — Wj ^ 0 , where 

e(S,S c ) denotes the number of cut edges as defined in Section [F3| With this in mind, it follows from the 
Chernoff bound that 


»o{ 


log "“ W >0 


dPo (y) 


< n e ° 

(ij)e£, i>j 



Y al °S 


(i,j)e£, i>j 


dP w (yij) 

dP 0 (yij) 


> 0 


£ 


OL log 


n 

(i,j)e£, i>j 


1 - (1 - a) Hel a (P w (yij) || P 0 (yij)) 


( 86 ) 

(87) 


= exp ( - (1 - a) Y D <* ( p «> (Vij) II p o (Vij)) 
(■ IJ)€£, i>j 


< exp — (1 — a] 


E^ 1 e(S i ,S9) 


DC 


( 88 ) 


(89) 


where (87) follows from the definition of the Hellinger divergence, ( 88 ) comes from the definition ([9|, and 
(89) arises since D a (P^ (yij) || P 0 (yij)) 7 ^ 0 if and only if wi — Wj 7 ^ 0 . 

Additionally, define the quantity 


M—l 


N w := I |U< {(l,j) e (5 j ,«S?)}| = I Y 1^1 (" - 1^1) 


i =0 
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then it follows from the definition of Q n v b that 




Binomial (. N w ,p ohs ). 


Unconditioning on £ in the inequality (89) gives 


J o|l 


dP«, (y) 


Po ' log di^5T >0 J s E 


N u 


exp (- (1 - a) e (Si,Si) D min 


= [ N r UbsO ~Pobs) Nw 1 exp {—Z • (1 — a) -D™ m } 


1=0 


N u 


1=0 


AU 


(1 - **.)*■ E (7 J ( i ^:) 1 •*» H • 0 - <0 } 


(a) 


(b) 


(1 - p ob8 )^ ( 1 + exp { — (1 — a) D ™ in } 

\ -L Pobs 


Pobs 




(l - Pobs +Pobs exp {- (1 - a) D™ m }) 


JV„ 


< exp { -N w p ohs (1 - exp (- (1 - a) D“ m ))} 

= exp | -N w p ohs (1 - a) Hel“ m j 

= exp j-pobs (1 - a) Hel™ n i ^n 2 - ^ n^j j , 


(90) 


where (a) follows from the binomial theorem, (b) relies on the elementary inequality 1 — x < e x , (c) comes 
from the definition (18), and the last line follows since 


M—l 

n w = z E l«Sil(«-l$l) 

Z i =0 


^ M—l 

2 E ni ( n " ni ) 
Z 2 = 0 



It remains to control 1 • Recognize that the input is unique only up to global offset, that is, for 

any /, the inputs w and w — l • 1 result in the same pairwise inputs [wi — wj] 1<i . <n . Therefore, we assume 
without loss of generality t hatful 


k = n 0 > max{ni,n 2 , • • • , n M -i } • (91) 

Letting p := , we claim that 1 n? under the constraint < k is maximized by the configuration 

{ n 0 = ni = • ■ • w n p _i = fc, 
n p = n — kp , 
n p+ i = • • • = riM-i 0, 

which we will prove by contradiction. Without loss of generality, suppose that the maximizing solution is 
no > ni > • • • > tim-1 , and denote by p the smallest index such that n p < k — 1. If p < p — 1, then by 
replacing (n^n^+i) with (n^ + l,np + i — 1), we obtain a strictly better feasible solution since 

0 n ~p + l) 2 + (np+i - l) 2 = n ~ + n~ +1 + 2 (rip - n^+i) + 2 > n~ + n~ +1 . 

16 Otherwise, if = max {n\, ri 2 , • • • , um-i} instead, we can always enforce a global shift i on w to yield w — il in order to 
satisfy this condition without affecting the output distribution. 
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This results in contradiction, and hence p = p. Similarly, we cannot have n p < n — kp , since replacing 
(n p ,n p +i) with (n p + l,n p+ i — 1) leads to a strictly better solution. Consequently, for all {n* : 0 < i < M} 
satisfying ( [91] ), one has 

<«> 


i =0 


leaving us two cases below to deal with. 

Case 1. Suppose that k < n/ 2. The inequality n — k < k leads to 


M—l 


Y n i ^ 


i =0 


n 

LfcJ 


■k 2 


(^n — k ^ ^ k = nk. 


This combined with (90) yields 


o { log > 0 J- < exp 


Pobs (n 2 - nk) 


dPo (: V) 

Employing the union bound over Hj , c we obtain 


(1 - a) Hel” 


p e,n k < 


(M — \) n ~ k exp 


Pobs (n 2 - nk) 


(1 -a) Hel* 


= exp log 


H k ) + (n - k) log (M - 1) - P ° bs b o (l _ a ) Hel” 


Under the assumption (23), one has 

(1 - a) Hel“ m -pobs« > log n + 2 log (M — 1) 


for some 0 < a < 1, which further gives 
Pobs (n 2 - nk) 


(1 - a) Hel” 


2 v 2 
Putting the above computation together yields 

kn\ (n — k) logn\ 0) 


> ("-m»s'» +(n _ t)loe(M _ 1) 


Pe,n k < exp ^log 
< C\Tl~ Cxn 


(ii) 


< 2 n . n -h(n-k) < 2 n • 


(93) 


for some universal constants Ci,ci > 0, where (i) uses the fact that (^) < 2 n , (ii) holds since k < n/2, and 
the last inequality follows since 2 n <C n~ e ^ n \ This approaches zero (super)-exponentially fast. 

Case 2. We now move on to the case where k > n/2. In this regime one has = 1, and thus (92) 
gives 

^ & 1 *? < k 2 + (n — fc) 2 = n 2 — 2k (n — k ), 

This taken collectively with (|90|) implies that 


o i log 


dP™ (2/) 


> 


dP 0 (2/) 

Apply the union bound over l~ik to deduce that 


oj < exp (-p obs (1 — a) k (n — k) Hel™ m ) 


(94) 


p e,n k < 


(M - l) n ~ k exp (-p obs (1 - a) k (n - k) Hel“ in ) 


= exp (log ( k ) + (n - k ) log (M - 1) - p obs (1 - a) k (n - k) Hel" 


(95) 
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For any constant 6 > 0, if the minimum Hellinger divergence obeys 

(1 - a) HeC n ■ p ohs n > (1 + 5) log (2n) + 2 log (M - 1) 

for some 0 < a < 1, then in the regime where k > n/2 one has 

(1 + S) k (n — k) log (2 n) 


Pobsk (n - k) (1 - a) Hel 


« > 


+ (n — k) log (M — 1). 


Substitution into (95) gives 

-Pe,w fc < exp (tog 

which can be further divided into two cases. 


(1 + $)k(n — k) 


n 


log (2 n) , 


(i) If k/n > 1 — 5/4 and k/n > 1/2, then the error probability is bounded by 

Pe,H k < exp ((n-k ) log (2 n) - ^ 7^ - (n - k ) log(2n) 


n 


= exp ((n — k) log (2n) • (1 — (1 + 5) fc/n)) 

< exp M n — k) log (2n) • f 1 — (1 + 5) max 11 — ^^ 


< exp 


(-<5 (n - fc) log (2n)) , 


where 5 := max {|5 — ^5 2 , ^ 2 ^}. 

(ii) If ^ = 1 — r for some | < r < \, then 


Pe^T-ik < ex P ( r ) - (1 + S) n (1 - r) r log (2rc)) 

< exp (—n ((1 — r) r log (2n) — H (r))) 

< C 2 exp (—C 2 ^nlogn). 


(96) 


(97) 


for some universal constants 02,62 > 0, where (96) makes use of the fact |20, Example 11.1.3] that 
^ log (^) < H(k/n ) = H(t ), with H(r ) denoting the binary entropy function. 

Putting the above inequalities together and applying the union bound reveal that 

n /2 ( 1 - f) n n —1 

Pe < E ^,n k + E P ^k + E p ^k 

k =\w] k=n/ 2 +l fc=(l_£)n 

n —1 

■ C\n~ Cin + ^ • 62 exp (—C 2 ^n logn) + exp 5 (n — k) log (2n)^ 

fe=(l-J)n 

1 


n 

< — • 1 

2 


71 71 1 

< 7 : • C\n~ Cin + ~ ■ C 2 exp (—c 2 ^nlogn) H- 

2 2 


(2n)° 1 - (2n) 


-5 


< C 0 e~ CoSn log " +- 1 -. 

( 2 nf - 1 

with Co, Co > 0 denoting some universal constants. 
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B Proof of Theorems [3](a) and [6] 

This section is mainly devoted to proving Theorem [6j which subsumes Theorem [3ja) as a special case. 
Without loss of generality, assume that the minimum KL divergence can be approached by the following 
pairs of indices 

KL(P 1 ||F 0 ) = KL min , 

KL (Pj || Po) < (1 + C) KL min , 2<l<m M ((), 

and suppose that both the ground truth and the null hypothesis are x = x* = 0 . We would like to ensure 
that the observation y conditional on x = 0 is distinguishable from the observation y under any alternative 
hypothesis x ^ 0 . 

(1) To begin with, recall the definition 

M ( k • mincut) := {S C V: e (<S, S c ) < k • mincut} . 

For each vertex set S G Af (k • mincut), we generate one representative hypothesis w such that 


Wi = 


1, if ieS, 

0, otherwise. 


This produces a collection of \Af (k • mincut) | distinct alternative hypotheses, denoted by £>&. For each 
w G 6^, the distributions P^ and P 0 disagree only over those locations residing in the associated cut set, 
which amounts to at most k • mincut components. It then follows from the independence assumption of yij 
that 

KL (¥ w || P 0 ) = e (S, S c ) KL min < k • mincut • KL min . (98) 


Suppose that ko := argmax/^i r^ ut and fix 0 < e < Applying the Fano-type inequality 57 Equation 
(2.70)] suggests that if 


IB*. 


h E kl. 


?o) < (1 — e) log M (ko • mincut) — H (e). 


(99) 




then one necessarily has inf^ P e (^) > e. With (98) and the definition (39) in mind, we see that (99) would 
follow from 

KL mm • /comincut < (1 — e) koT cut — H (e), 

which can further be ensured if 


KL n 


mincut < (1 — e) r cut — H (e). 


(2) Next, suppose that the minimum cut is attained by (<S mc , <S^ C ). Consider another class C of hypotheses 
consisting of m kl hypotheses. The Ith. candidate is given by 


VI < / < m kl : ufP = 


1 7 if i G <S mc , 
0, otherwise, 


all of which obey 


KL 


(to^||o) < (1 + c) mi 


mincut • KL n 


Applying the Fano inequality once again, we get inf^ P e {tfj) > e as long as 

m kl 

E KL ( w;(/) ll 0 ) - 


( 100 ) 


(101) 


1=1 


Observe from (100) that (101) can be ensured under the condition 

KL min • (1 + C) mincut < (1 - e) log m kl - H (e). 
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(3) Finally, consider the set of configurations with binary alphabet having support size 1, i.e. the following 
M — 1 classes of hypotheses 


Hn={x | ||*|| 0 = 1, x e {0,/} n }, 1 <1 < M, 


( 102 ) 


where each class Hi is composed of n distinct alternative hypotheses. This guarantees that for any w G Hi, 
the distribution of {yij} \ x = 0 differ from that of {yij} \ x = w in at most d max locations. 

For any hypothesis class H and any 0 < e < the Fano-type inequality 57, Equation (2.70)] suggests 
that inf^ P e (' ip ) > e occurs as long as 


m E KL ( p * 


?o) < 


m 


\n\ ^- - \h\ 


{(l-e)\og\n\-H(e)}. 


(103) 


By picking H to be H = U/Hi ^ i —which obeys \H\ = m k n —we can see from definition of m kl that 

|T E KL (P„ II Po) < (1 + C) <axKL min 


wen 


and hence (103) would hold under the condition 

KL mm • (1 + c) dmax < (1 - e) (log n + log m kl ) - H (e). 


(104) 


Putting the above results together establishes Theorem [6] 

We now specialize to Theorem [3^ a), which follows immediately from (104). Specifically, for an Erdos-Renyi 
graph Q ^ Qn,p ohs , the Chernoff-type inequality |45| Theorems 4.4-4.5] indicates that for any e > 0, 


dm ax ^ (1 + e)np ohs 


(105) 


holds with probability exceeding 1 — n 10 , provided that p ( ,bs > cll (f - for some sufficiently large constant 
c > 0. Substitution into (104) immediately leads to Theorem [3]( a). 


C Proof of Theorems [3](b) and [7] 

We start with the proof of Theorem [t| which accounts for a much broader context than Theorem [3^b). In 
similar spirit of Theorem [6| assume that the minimum Hellinger divergence is achieved by the following pair 
of indices 


HeU(P 1 ||P 0 ) = Hel^, 

and let the ground truth and the null hypothesis be x = x* = 0 . 

For any class H of alternative hypotheses, the minimax lower bound [33] Theorem II. 1] suggests that 
every /-divergence Df (•) obeys 

E D f ( p - ii p °) ^ f (M (! - p e)) + m -1) / ) * 

where P^ is the probability measure of [Vij]^j^ e g conditional on x = w. When specialized to the Hellinger 
divergence of order a (which corresponds to f(x ) = (1 — x a )), the above inequality leads to 

(1 - a) E He| a ( p - II P o) > 

wen 

> 
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Put another way, 


p« > 


. (l-«)E we «Hel a (P w ||P 0 ) 

\n\ 


\n\ 


1—a ’ 


(106) 


Notably, for any product measures P n = P x P x • • • x P and Q n = Q x Q x • • • x Q, the Hellinger 
divergence satisfies the decoupling equality 


1 - (1 - a) Hel a ( P n || Q n ) = J (d P n ) a (dQ n ) 1- " = (^ j (d P) a (dQ) 1- "^ 

= (l-(l-a)Hel a (P||Q)) n . 


(107) 


If all hypotheses w G T-L satisfy \\w — sc*|| 0 < k , then and Po are different over at most &d max locations. 
Thus, if the divergence measure at each of these locations is identical and equal to some given value h a , then 
it follows from the independence assumption of yij that 


1 — (1 — a) Hel c 


J o) > (l - (1 - ot) 


kd n 


This together with (106) suggests that: as long as (1 — a) h a < |, one necessarily has 

( \ kdjxi&x (-i „ \ 

l-(l-a)h a ) -\n\- (1 - a) 


pa > 


— (1—a) 


> e -((l-a)h a + (l-a) 2 hl)kd max _ |^| —C 1 

which results from the inequality that log (1 — x) > —x — x 2 for any 0 < x < 1/2. 
As a consequence, if the following condition holds 


or, equivalently, 


e -((l-a)/ic + (l-a) 2 /i^)/cd ma x _ |^|-( 1-Q 0 > £<* 

log (e + \n\- {1 ~ a) ) 


(1 - a) h a [1 + (1 - a) h a ] < — 


kd n 


(108) 


then the minimax probability of error must exceed inf^ P e > £ . Sol ving the quadratic inequality (|108|) and 
utilizing the fact yj\ + Ax — 1 > 2x — 4x 2 (x > 0), we see that (108) would follow as long as 


(1 — a) h a < — 


log (e + |pr (1 “ a) ) 2 log 2 (e + |pr (1 " a) ) 


kd n 


(/cd max ) 


(109) 


Finally, setting £ = n e and H = Pi (cf. Definition (102)), one has \H\ = n, k = 1 and h a = Hel™ m - In 
the regime where 

. 1 -« ^ . 1 
e < - <^> a < 


a 


1 + e’ 


have 


= n~ ea > n“ (1 “ a) = \H\~ (1 ~ a) 


The condition (109) is then guaranteed to hold if 


(1 - a) Hel“ in < - 


log (2£“) 2 log 2 (2£° 


d 2 

max 


which would follow if 


(1 - a) HeC n < 


ea log n — log 2 2 [ea log n — log 2] 2 


d n 


d 2 

max 


( 110 ) 


(in) 
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where we have used £ a = n ea . Besides, the condition (1 — a) h a < \ becomes (1 — a) Hel™ m < h f which 


can be ensured under (111) together with the condition 


ea log n 1 
rJ — 9 


as claimed. 

Finally, recall that when Q ~ Gn,p ohs , one has d max < (1 + c)p Q \ )S n as long as np 0 bs/logn is sufficiently 
large. Plugging this into the preceding bound completes the proof of Theorem [3^b). 


D Proof of Theorem 0 


Note that ^mi distinguishes the null hypothesis x = x* = {x\} 1<i<n from the alternative hypothesis x = 
w = {wi} 1<i<n only based on those components (i,j) where 

x* — X* ^ Wi — Wj , 

and its recovery capability depends only on the distinction of output distributions over these locations. 
For ease of presentation, we will suppose in the rest of the proof that both the ground truth and the null 
hypothesis are x = 0 , but note that the proof carries over to all other ground truth values. 

Let’s divide the set of all alternative hypotheses into several classes Ak so that for each k > 1, 

Ak := {w / 0 : |£ H supp (w © w)\ < k • mincut} , (112) 


where we employ the notation 

£ fl supp (w © w) := {(i, j) G £ | Wi — Wj ± 0, i > j} . 

Apparently, any cut set cannot contain more than n 2 edges, and hence Ak = 0 for any k > n 2 /mincut. For 
any w G Ak, if we let Si represent the set of vertices taking the value /, then by definition of Ak one has 


M—l 


^ e (Si, Si) = 2 \£ D supp (w © w)\ < 2 k • mincut. 


(113) 


1=0 


On the other hand, consider the case where k ml. All w G A\ are equivalent to 0 up to some global offset. 
This is because for any non-trivial cut ( Si,Sf ), one must have \£ D supp (w © tu)| > e(Si,Sf) > mincut, 
which violates the feasibility constraint \£ D supp (w © w)\ < mincut. In the following lemma, we link the 
cardinality of each hypothesis class Ak with the cut-homogeneity exponent r cut defined in 


Lemma 3. For any k < n 2 /mincut, the hypothesis class Ak defined in (112) satisfies 


log IAI 
k 


< 2 log M + 2 log (2k • mincut) + 4r cut 

< 2 log M + 4 log (2 n) + 4r cut . 


Proof. See Appendix El 


(114) 

□ 


We are now in position to characterize the recovery ability of Let (•) denote the measure given 
x = w. For any 0 < a < 1, it follows from (89) that 


dff^y) ] 

S dP 0 (y) J 


< exp — (1 — a) 


E£o 'e&Sf) 




(115) 


When restricted to the hypotheses in Ak\Ak-i for any 2 < k < n 2 / mincut, we know from the definition of 
Ak that 




(k — 1) mincut < \£ fl supp (w © w)\ = 


< /cmincut. 


31 










It then follows from the union bound that 


Poja^e A k \Ak~i ■ > 0 } < |A|ex P (- (1 - a) e J^iT!l D ^n 


dPo (y) 


< exp y — ( k — 1) |^(1 — a) D\ 

< exp ( — (k — 1) ( (1 — a) D\ 


mincut 


k log | _4. fc | 


k 


min ■ „ -l0g|A|\ 

mincut-- 1 1 


k — 1 

21 _ „ 

k J / 

< exp {- (jfe - 1) ((1 - a)D“ in • mincut - (41ogM + 81og(2n) + 8r cut ))} 
where (117) results from Lemma [ 3 ] This suggests that under the condition 

(1 - a) £>™ in • mincut > (6 + 8) log (2n) + 8r cut + 4 log M 
for some 0 < a < 1, one achieves 

n 2 /mincut 


( 116 ) 

(117) 


Pe (V’ml) < 


£ fbji-eAVU.,: .ogP^M>0 

k =2 ^ 


dPo (y) 


} 


< ^exp{-(fc- 1) ((1 -a)D™ in ■ mincut - (41ogM + 8 log (2n) + 8r cut ))} 
k> 2 

< exp (— k • S log (2 n)) 


< 


k> 1 

1 


(2 n) S 1 - (2 n)~ 6 


{2n)° - 1 

To finish up, recognizing that D™ in > Hel™ m immediately establishes the recovery condition based on 


Heh 


E Proof of Lemma Q] 

(1) Define the cut-edge degree of a vertex v to be the number of edges in £ (S,S C ) that v is incident to. 
Consider any cut (<S,<S C ) with size 

e (<S, S c ) < k • mincut. (118) 

We shall separate all vertices into two types as follows: 

• Type-1 vertex : any vertex whose cut-edge degree is at least • mincut; 

• Type-2 vertex : any vertex whose cut-edge degree is less than \np • mincut. 

For ease of presentation, we will color all vertices in S black and all vertices in S c white; each feasible coloring 
scheme thus corresponds to one valid cut (S,S C ) in Af (k • mincut). 

To develop some intuitive understanding of the above notions, we depict in Fig. [5] an example of a cut 
( S,S C ) in a geometric graph, where S c consists of all vertices residing within the shaded area, and the blue 
solid edges indicate the cut edges. Typically, type-1 vertices, which are incident to many cut edges, are 
lying on or close to the boundary of the cut. In Fig. [5j these correspond to those vertices lying around the 
boundary of the shaded area in addition to those singleton white vertices. In contrast, type-2 vertices often 
refer to those staying away from the cut boundary (e.g. those white nodes in the center of the shaded area). 
It may be useful to keep this figure in mind when reading about the subsequent proof. 

To prove Lemma [lj we start by examining how many combinations of type-1 vertices are feasible and 
how many ways there are to color them. By definition, for any cut obeying (118), the number V\ of type-1 


vertices satisfies V\ < ^ (note that each edge is incident to two vertices and might be counted 

twice). Simple combinatorial arguments thus suggest that there are at most n 4k ^ p distinct ways to pick 
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Figure 5: An example of the cut (<S,<S C ) in a geometric graph. Here, S consists of all black vertices, while 
S c contains all white vertices. The blue solid edges represent the cut edges. 


these type-1 vertices, and then no more than 2 Vl < 2 Ak ^ p ways to color all these type-1 vertices, and finally 
at most ( 2e ^’‘ S < (2k • mincut)^ 1 < (2k • mincut) 4/c ^ p different combinations of cut-edge degrees among 

them. Taken together these counting arguments imply that there exist no more thaij^] 

n Ak ^ p (2k • mincut) 4/c/ " p 2 4 ^ < (2 n) 8k/Kp 


distinct ways to select the set of type-1 vertices as well as assign colors and cut-edge degrees for each of 
them, if one is required to satisfy the cut size constraint (118). 

We claim that for any cut (S,S C ) obeying (118), once the following three pieces of information are 
gathered: 

(i) which vertices are type-1 vertices, 

(ii) the cut-edge degrees of these type-1 vertices, 


(iii) the colors of these type-1 vertices (i.e. whether they belong to S or S c ), 

then the colors of all remaining vertices (and hence all information about this cut) can be uniquely deter¬ 
mined. Following the preceding pictorial interpretation, the whole point of this claim is to demonstrate that 
as long as some appropriate conditions regarding the cut boundary is known, then one can figure out all 
remaining cut information. To establish this claim, we shall consider the following two cases separately. 
Without loss of generality, the following discussion concentrates only on black type-1 vertices. 


• Case 1 . Consider any vertex v whose color has been revealed to be black, and whose cut-edge degree 
does not exceed 

1 — p • mincut, (119) 

namely, v is connected with no more than (l — p • mincut white vertices. For any of its neighbors 
u (i.e. (u,v) G £), if the color of u has not been revealed, then we claim that it must be black. To 
see this, suppose instead that u is white, then from the above connectivity assumption (119) of u, the 
number of black vertices that u is linked with is at least 


|V(u)nV(v)| 

> pmincut 



p • mincut 


p • mincut 


1 

-ftpmincut, 


17 Here, we use the fact that 2k • mincut < n 2 , and hence (2 k ■ mincut) 4 *^ < n sk / Kp . 
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where the inequality follows from Assumption (41). This means that u must be a type-1 vertex (cf. 
definition of type-1 vertices) and its color must have been revealed, thus resulting in contradiction. In 
summary, all neighbors of v with unknown colors are necessarily black. 


• Case 2. Consider any vertex v whose color has been revealed to be black, and whose cut-edge degree 
is known to be larger than (l — |/c) p • m incut. Again, consider any of its neighbors u whose color 
remains unknown, which must be incident to fewer than \kp • mincut cut edges since by construction 
it is a type-2 vertex. This already suggests the following fact: if there are at least \kp • m incut vertices 
falling in V (u) D V ( v ) known to be white (resp. black), then the color of u must be white (resp. black), 
since by definition a type-2 vertex cannot be connected to ^Kp • mincut vertices of opposite color. As 
a result, we can uniquely determine the color of u unless 

— (PI) the colors of fewer than Kp • m incut verticein V (u) D V (v) have been revealed. 


This remaining situation is the subject of the discussion below. 


Suppose that the true color of u is black. Recall that u is a type-2 vertex and hence it is connected 
to fewer than ^Kp white vertices. From Assumption (41) and the condition k < |, any white neighbor 
w of u must be connected with at least 


|V (u) fl V (w )| — -Kp • mincut > 



1 

pmincut > -Kp • mincut 


black vertices falling within V (u) D V(w), and hence w must be a type-1 vertex and its color has 
necessarily been identified. Similarly, if u is white, then the colors of all black vertices surrounding u 
must have been revealed. As a result, all vertices in V(u) with unknown colors must be of the same 
color as u. That being said, as long as one can identify the color of one extra vertex in V (u) flV(v), 
then the color of u and all remaining vertices in V (u) D V ( v ) can be uniquely determined. 


Now let w be the uncolored vertex in V (u) D V (v) that is the nearest to v, which by (PI) must 
be within the (^pmincut) closest vertices to v in V (u) (lV(v). From Assumption (42), we see that w 
must be connected to all but \p • mincut neighbors surrounding v and, as a result, be connected to at 
least 


cut-degree(v) — |V (v) \V (w)\ > 




p • mincut — -p • mincut 


1 , 1 
= - (1 — k) p • mincut > -ftp- mincut. 


white vertices since k < where cut-degree (v) represents the cut-edge degree of v. Therefore, if w is 
black, then it has to be a type-1 vertex, which is contradictory, and we have determined it to be white. 


Putting the above two cases together indicates that all vertices that are connected to the set of type-1 
vertices can be uniquely colored, and we shall use V new to denote them. If there still exist uncolored vertices, 
a nonempty subset of them must be connected to V new - Since all vertices in V new are type-2 vertices and 
have cut-degrees not exceeding ^Kpm incut < (l — \k) pm incut, repeating the arguments in Case 1 allows us 
to determine the color of all vertices surrounding V ne w- This step further shrinks the size of the uncolored 
set. Repeating this argument until all vertices are colored, we establish the claim. All in all, we have thus 
demonstrated that the number of feasible coloring schemes is bounded above by (2 n) 8k ^ Kp , which in turn 
justifies 


T k < 


8 log (2 n) 

Kp 


\/k > 1. 


(2) If Q is an expander graph with edge expansion hg, then for any vertex set S with \S\ < one has 


|S| <e(S,S c )/h g 


( 120 ) 


18 Otherwise there are either jftpmincut white vertices or iKpmincut black colors in £ (u) n £ (v) with their colors revealed. 
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from the definition of hg. For any d > 0, if one requires that 

e ( 5 , S c ) < kd, 


( 121 ) 


then the above inequality leads to 


\S\ < kd/hg , 


indicating that there are at most 2(|^j) < 2 n kd / hg feasible cuts (<S,<S C ) satisfying ( 121 ). Setting d 
immediately leads to 

\JV(k- mincut)| < 2n kmmcut/he , 


= mincut 


T k Ut = T (k • mincut)| < 


mincut log n log 2 

hg + IF’ 


Vfc > 1 


as claimed. 


F Proof of Lemma [2] 


We begin with explicit expressions of the divergence measures. For any k 0 / and p G [0,1], one has 

KL 04 + (1 - p)Unif M || pSi + (1 - p)Unif M ) 

1 


= (^P + 
= P log 


log 


M 

(M — l)p + 1 
1 -P 


* +i in + i^ log 


1 ~P 
M 


/ ' M 

M ' 


K P- 


1 -P 
M , 


( 122 ) 

(123) 


where 4 denotes the Dirac measure on the point &, and (122) follows since the two distributions under study 
differ only at two points x = k and x = l. Similarly, one obtains (cf. Definition @) 


Heli {p6k + (1 -p)Unif M || pSi + (1 -p)Unif M ) 

2 

/ / 1 — n 1 1 — n \ 


(PF-i/tF) 


When applied to the outlier model, these suggest 


KL mm = p true log 


1 + PtrueM \ < p[ 


M 


1 Ptrue / 1 Ptrue 


1 Ptrue T -^dPtrue Ptrue^ 


and Heir n = — 

2 M 

It remains to control the Hellinger divergence. To this end, the elementary identity a — b = 


a 2 -b 2 

a+6 


(124) 


(125) 


(126) 


gives 


(0 Ptrue “h d/j>t,. ue \/t Ptrue) (^ 


Ptrue M 


VI - Ptrue + MPt rue d" Vf Ptrue 


> 


Ptrue M 


2Vl “Ptrue + MPt r 


Ptr 


4(1 Ptrue T MPtrue) 


indicating that Hel 1 ? 111 > ^— Ptru T^ - 7 as claimed 

& 2 — 2(l-ptrue + Mp t rue) 
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G Proof of Lemma [3] 

Consider any hypothesis x = w G Ak, which obeys |£ D supp (w © w)\ < k • mincut. Denote by <S/ the set 
of vertices that take the value l (0 < l < M), and let X^ := {/ | Si ^ 0} represent the indices of those 
non-empty ones. Our proof proceeds by evaluating the following quantities: 

1 . How many distinct choices of X^ are admissible? 

2. For each given X^, how many combinations of cut-set sizes {e (Si,S{) | l G X are feasible? 

3. For each given cut-set size TV/, how many cuts (Si,S{) are compatible with the constraint e (Si,S{) < 
TV/? 

Clearly, multiplying all these quantities together gives rise to an upper bound on \Ak\- 
We now compute the above quantities separately. 

• To begin with, our assumption on the min-cut size ensures that 

e(<S/,<S/ c ) > mincut (127) 

for each non-empty Si. This together with the feasibility constraint 


M—l 

2 \£ n supp (w © w)\ = ^ e (Si, Si) < 2k • mincut (128) 

o 


guarantees that the number of non-empty <S/’s cannot exceed 2k. Consequently, there exist at most 
M 2k possible combinations of X^. 


• Secondly, from (128), the total cut-set size is bounded above by 2k • mincut. Therefore, for any given 
X^ 0 , there are no more than 

07 - ^ 


/2/cmincut\ 

V \^\ ) 


< (2k • mincut) 


feasible ways to assign cut-set sizes e (Si,Sf) for all l G X^. 


• Thirdly, suppose that for each l G X^, 


e (Si, Si) = ci • mincut 


(129) 


for some numerical values q > 1. From the definition (39), the number of feasible choices of (5/,5 z c ) 
compatible with Ql29|) is bounded above by 


|A f (ci • mincut)| ^ |AC (| ~c{\ mincut) | < exp (\cf\ r cut ) < exp ( 2 c/r cut ) . 
Recognize that the constraint ( |128| ) requires 


ci < 2k. 
i 

As a result, when the cut sizes e (Si,S£) are given, the total number of valid partitions {Si | 0 < / < M} 
cannot exceed 

M—l 

n exp (2 ciT cut ) < exp (4fcr cut ) . (130) 

1=0 


Putting the above combinatorial bounds together implies that 

| A k \ < M 2k (2k • mincut) 2/c exp (4A:r cut ) . 
Using the inequality /cm incut < n 2 we conclude the proof. 


36 





H Proof of Fact Q] 


Recall that KL divergence and Hellinger divergence are both /-divergence associated with the non-negative 
convex functions fi (x) = x\ogx — x + 1 and / 2 (x) = (y/x — 1) , respectively. That said, one can write 


KL (P || Q) = Eq 


h 


/d P 

VdQ 


and Heli (P\\Q) = Eq 




One can verify that the function /i can be uniformly bounded above using / 2 in the following way: 

(2 - 0.5 |log a?|) f 2 (x) < fi ( x) < (2 + |logx|) f 2 (x ), Vx > 0. 

This immediately establish that 


KL (P || Q) = Eq 


fi 


dP\ 

d Qj 


< (2 + log R) Eq 


h 


dP 

d Q 


and 


KL (P || Q) = Eq 




> (2 — 0.5 log i?) Eq 




= (2 + log R) Hell (P || Q) 


= (2 - 0.5 log R) Heli (P || Q) ■ 


These together with the well known inequality (57| Lemma 2.4] 

KL (P || Q) > Heli (P || Q) 


establish (|19|. 

Similarly, from the inequality 

(2 - 0.4 |logx|)/ 2 (x) < /i (x) < (2 + 0.4 |log x\) / 2 (x ), \/x G (0,4.5], 


one can show that 

max{2 - 0.4logR, 1} • Heli (P || Q) < KL (P || Q) < (2 + 0.4log R) • Heli (P || Q) (131) 

as long as R < 4.5, as claimed. 
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